# Home Credit Default Risk: Complete EDA,
# data preparation, data analysis
# and feature selection via data visualization techniques.
# Group 3 - DSEB 62:
# 1. Mai Xuan Bach
# 2. Tran Thi Hanh
# 3. Nguyen Hoang Long
# 4. Tong Le Khanh Nhi
We use Plotly to effectively visualize the pie chart. If you can not see the plot which is the pie chart, i.e, the output section is totally blank, maybe you have a problem with the current version of plotly or you did not install plotly. The solution is quite simple as follows:
In Annaconda promt:
Wait until this process ends, reinstall it
Hope this instructions work well. Otherwise, please contact us!
We use "jupyter-navbar" extension to have the navigation bar (table of contents) like in the below picture. Please follows these super-easy instruction to have it:
We are using a typical data science stack: numpy, pandas, matplotlib, seaborn, plotly.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
# Set option to see the full width of each column
pd.set_option("display.max_colwidth", None)
# Set option to see every columns of the dataframe
pd.set_option('display.max_columns', None)
There are a total of 10 files: 1 main file for training (with target) 1 main file for testing (without the target), 1 example submission file (no need for our tasks), 1 file description for each columns in each csv file and 6 other files containing additional information about each loan.
application_train = pd.read_csv("./home-credit-default-risk/application_train.csv")
application_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
application_test = pd.read_csv("./home-credit-default-risk/application_test.csv")
application_test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -2329 | -5170.0 | -812 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | NaN | 2.0 | 2 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.0660 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | 0.0672 | 0.0612 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0526 | NaN | NaN | 0.0666 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0514 | NaN | NaN | NaN | block of flats | 0.0392 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -1740.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -4469 | -9118.0 | -1623 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -4458 | -2175.0 | -3503 | 5.0 | 1 | 1 | 0 | 1 | 0 | 0 | Drivers | 2.0 | 2 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | NaN | 0.699787 | 0.610991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -1866 | -2000.0 | -4208 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Sales staff | 4.0 | 2 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.3052 | 0.1974 | 0.9970 | 0.9592 | 0.1165 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2042 | 0.2404 | 0.3673 | 0.0386 | 0.08 | 0.3109 | 0.2049 | 0.9970 | 0.9608 | 0.1176 | 0.3222 | 0.2759 | 0.375 | 0.0417 | 0.2089 | 0.2626 | 0.3827 | 0.0389 | 0.0847 | 0.3081 | 0.1974 | 0.9970 | 0.9597 | 0.1173 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2078 | 0.2446 | 0.3739 | 0.0388 | 0.0817 | reg oper account | block of flats | 0.3700 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -1805.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -2191 | -4000.0 | -4262 | 16.0 | 1 | 1 | 1 | 1 | 0 | 0 | NaN | 3.0 | 2 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -821.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
bureau = pd.read_csv("./home-credit-default-risk/bureau.csv")
bureau.head()
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance = pd.read_csv("./home-credit-default-risk/bureau_balance.csv")
bureau_balance.head()
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance = pd.read_csv("./home-credit-default-risk/credit_card_balance.csv")
credit_card_balance.head()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | 1800.0 | 1800.0 | 0.000 | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 60175.080 | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 26926.425 | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | 11925.0 | 11925.0 | 224949.285 | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | 27000.0 | 27000.0 | 443044.395 | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
installments_payments = pd.read_csv("./home-credit-default-risk/installments_payments.csv")
installments_payments.head()
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
POS_CASH_balance = pd.read_csv("./home-credit-default-risk/POS_CASH_balance.csv")
POS_CASH_balance.head()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
previous_application = pd.read_csv("./home-credit-default-risk/previous_application.csv")
previous_application.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
HomeCredit_columns_description = pd.read_csv("./home-credit-default-risk/HomeCredit_columns_description.csv", encoding = "ISO-8859-1")
HomeCredit_columns_description.head()
| Unnamed: 0 | Table | Row | Description | Special | |
|---|---|---|---|---|---|
| 0 | 1 | application_{train|test}.csv | SK_ID_CURR | ID of loan in our sample | NaN |
| 1 | 2 | application_{train|test}.csv | TARGET | Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) | NaN |
| 2 | 5 | application_{train|test}.csv | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
| 3 | 6 | application_{train|test}.csv | CODE_GENDER | Gender of the client | NaN |
| 4 | 7 | application_{train|test}.csv | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can help us decide which features to use.
This table is static data for all applications. One row represents one loan in our data sample.
application_train.shape
(307511, 122)
The target is what we are asked to predict: either a 0 for the loan was repaid on time, or a 1 indicating the client had payment difficulties. We can first examine the number of loans falling into each category.
application_train['TARGET'].value_counts()
0 282686 1 24825 Name: TARGET, dtype: int64
application_train['TARGET'].astype(int).plot.hist();
From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid.
Next we can look at the number and percentage of missing values in each column.
# Function to calculate missing values by column
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
# Missing values statistics
missing_values = missing_values_table(application_train)
missing_values.head(20)
Your selected dataframe has 122 columns. There are 67 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| COMMONAREA_MEDI | 214865 | 69.9 |
| COMMONAREA_AVG | 214865 | 69.9 |
| COMMONAREA_MODE | 214865 | 69.9 |
| NONLIVINGAPARTMENTS_MEDI | 213514 | 69.4 |
| NONLIVINGAPARTMENTS_MODE | 213514 | 69.4 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 69.4 |
| FONDKAPREMONT_MODE | 210295 | 68.4 |
| LIVINGAPARTMENTS_MODE | 210199 | 68.4 |
| LIVINGAPARTMENTS_MEDI | 210199 | 68.4 |
| LIVINGAPARTMENTS_AVG | 210199 | 68.4 |
| FLOORSMIN_MODE | 208642 | 67.8 |
| FLOORSMIN_MEDI | 208642 | 67.8 |
| FLOORSMIN_AVG | 208642 | 67.8 |
| YEARS_BUILD_MODE | 204488 | 66.5 |
| YEARS_BUILD_MEDI | 204488 | 66.5 |
| YEARS_BUILD_AVG | 204488 | 66.5 |
| OWN_CAR_AGE | 202929 | 66.0 |
| LANDAREA_AVG | 182590 | 59.4 |
| LANDAREA_MEDI | 182590 | 59.4 |
| LANDAREA_MODE | 182590 | 59.4 |
Next we can look at the number and values of unique values in each column.
def unique_values_table(df):
columns = []
columns_nunique = []
columns_unique_values = []
# Number of Unique values
for col in df:
columns.append(col)
columns_nunique.append(df[col].nunique())
# Unique values
for each in df.columns:
columns_unique_values.append(df[each].unique())
# Make df contains information
unique_table = pd.DataFrame(list(zip(columns_nunique, columns_unique_values)),
columns=['number of unique values', 'unique values'], index = columns)
return unique_table
unique_values_table(application_train)
| number of unique values | unique values | |
|---|---|---|
| SK_ID_CURR | 307511 | [100002, 100003, 100004, 100006, 100007, 100008, 100009, 100010, 100011, 100012, 100014, 100015, 100016, 100017, 100018, 100019, 100020, 100021, 100022, 100023, 100024, 100025, 100026, 100027, 100029, 100030, 100031, 100032, 100033, 100034, 100035, 100036, 100037, 100039, 100040, 100041, 100043, 100044, 100045, 100046, 100047, 100048, 100049, 100050, 100051, 100052, 100053, 100054, 100055, 100056, 100058, 100059, 100060, 100061, 100062, 100063, 100064, 100068, 100069, 100070, 100071, 100072, 100073, 100075, 100076, 100077, 100078, 100079, 100080, 100081, 100082, 100083, 100084, 100085, 100086, 100087, 100088, 100089, 100093, 100094, 100095, 100096, 100097, 100098, 100099, 100100, 100101, 100102, 100103, 100104, 100105, 100108, 100110, 100111, 100112, 100113, 100114, 100115, 100116, 100118, ...] |
| TARGET | 2 | [1, 0] |
| NAME_CONTRACT_TYPE | 2 | [Cash loans, Revolving loans] |
| CODE_GENDER | 3 | [M, F, XNA] |
| FLAG_OWN_CAR | 2 | [N, Y] |
| ... | ... | ... |
| AMT_REQ_CREDIT_BUREAU_DAY | 9 | [0.0, nan, 1.0, 3.0, 2.0, 4.0, 5.0, 6.0, 9.0, 8.0] |
| AMT_REQ_CREDIT_BUREAU_WEEK | 9 | [0.0, nan, 1.0, 3.0, 2.0, 4.0, 5.0, 6.0, 8.0, 7.0] |
| AMT_REQ_CREDIT_BUREAU_MON | 24 | [0.0, nan, 1.0, 2.0, 6.0, 5.0, 3.0, 7.0, 9.0, 4.0, 11.0, 8.0, 16.0, 12.0, 14.0, 10.0, 13.0, 17.0, 24.0, 19.0, 15.0, 23.0, 18.0, 27.0, 22.0] |
| AMT_REQ_CREDIT_BUREAU_QRT | 11 | [0.0, nan, 1.0, 2.0, 4.0, 3.0, 8.0, 5.0, 6.0, 7.0, 261.0, 19.0] |
| AMT_REQ_CREDIT_BUREAU_YEAR | 25 | [1.0, 0.0, nan, 2.0, 4.0, 5.0, 3.0, 8.0, 6.0, 9.0, 7.0, 10.0, 11.0, 13.0, 16.0, 12.0, 25.0, 23.0, 15.0, 14.0, 22.0, 17.0, 19.0, 18.0, 21.0, 20.0] |
122 rows × 2 columns
Describe some basic statistics such as frequency, mean, std, IQR, min, max,...
# Numeric colums
application_train.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 104582.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307509.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 134133.000000 | 3.068510e+05 | 246546.000000 | 151450.00000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 159080.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 307510.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | -4986.120328 | -2994.202373 | 12.061091 | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | 2.152665 | 2.052463 | 2.031521 | 12.063419 | 0.015144 | 0.050769 | 0.040659 | 0.078173 | 0.230454 | 0.179555 | 0.502130 | 5.143927e-01 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | 0.114231 | 0.087543 | 0.977065 | 0.759637 | 0.042553 | 0.074490 | 0.145193 | 0.222315 | 0.228058 | 0.064958 | 0.105645 | 0.105975 | 0.008076 | 0.027022 | 0.117850 | 0.087955 | 0.977752 | 0.755746 | 0.044595 | 0.078078 | 0.149213 | 0.225897 | 0.231625 | 0.067169 | 0.101954 | 0.108607 | 0.008651 | 0.028236 | 0.102547 | 1.422245 | 0.143421 | 1.405292 | 0.100049 | -962.858788 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | 0.088055 | 0.000192 | 0.081376 | 0.003896 | 0.000023 | 0.003912 | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 11.944812 | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | 0.910682 | 0.509034 | 0.502737 | 3.265832 | 0.122126 | 0.219526 | 0.197499 | 0.268444 | 0.421124 | 0.383817 | 0.211062 | 1.910602e-01 | 0.194844 | 0.10824 | 0.082438 | 0.059223 | 0.113280 | 0.076036 | 0.134576 | 0.100049 | 0.144641 | 0.161380 | 0.081184 | 0.092576 | 0.110565 | 0.047732 | 0.069523 | 0.107936 | 0.084307 | 0.064575 | 0.110111 | 0.074445 | 0.132256 | 0.100977 | 0.143709 | 0.161160 | 0.081750 | 0.097880 | 0.111845 | 0.046276 | 0.070254 | 0.109076 | 0.082179 | 0.059897 | 0.112066 | 0.076144 | 0.134467 | 0.100368 | 0.145067 | 0.161934 | 0.082167 | 0.093642 | 0.112260 | 0.047415 | 0.070166 | 0.107462 | 2.400989 | 0.446698 | 2.379803 | 0.362291 | 826.808487 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | 0.283376 | 0.013850 | 0.273412 | 0.062295 | 0.004771 | 0.062424 | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014568 | 8.173617e-08 | 0.000527 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | 5.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.334007 | 3.924574e-01 | 0.370650 | 0.05770 | 0.044200 | 0.976700 | 0.687200 | 0.007800 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.050400 | 0.045300 | 0.000000 | 0.000000 | 0.052500 | 0.040700 | 0.976700 | 0.699400 | 0.007200 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.016600 | 0.054200 | 0.042700 | 0.000000 | 0.000000 | 0.058300 | 0.043700 | 0.976700 | 0.691400 | 0.007900 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.051300 | 0.045700 | 0.000000 | 0.000000 | 0.041200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.505998 | 5.659614e-01 | 0.535276 | 0.08760 | 0.076300 | 0.981600 | 0.755200 | 0.021100 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048100 | 0.075600 | 0.074500 | 0.000000 | 0.003600 | 0.084000 | 0.074600 | 0.981600 | 0.764800 | 0.019000 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.045800 | 0.077100 | 0.073100 | 0.000000 | 0.001100 | 0.086400 | 0.075800 | 0.981600 | 0.758500 | 0.020800 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048700 | 0.076100 | 0.074900 | 0.000000 | 0.003100 | 0.068800 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.675053 | 6.636171e-01 | 0.669057 | 0.14850 | 0.112200 | 0.986600 | 0.823200 | 0.051500 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.085600 | 0.121000 | 0.129900 | 0.003900 | 0.027700 | 0.143900 | 0.112400 | 0.986600 | 0.823600 | 0.049000 | 0.120800 | 0.206900 | 0.333300 | 0.375000 | 0.084100 | 0.131300 | 0.125200 | 0.003900 | 0.023100 | 0.148900 | 0.111600 | 0.986600 | 0.825600 | 0.051300 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.123100 | 0.130300 | 0.003900 | 0.026600 | 0.127600 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 91.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 20.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.962693 | 8.549997e-01 | 0.896010 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
# Object/categorical columns
application_train.describe(include=["object", "category"])
| NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511 | 307511 | 307511 | 307511 | 306219 | 307511 | 307511 | 307511 | 307511 | 211120 | 307511 | 307511 | 97216 | 153214 | 151170 | 161756 |
| unique | 2 | 3 | 2 | 2 | 7 | 8 | 5 | 6 | 6 | 18 | 7 | 58 | 4 | 3 | 7 | 2 |
| top | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | Laborers | TUESDAY | Business Entity Type 3 | reg oper account | block of flats | Panel | No |
| freq | 278232 | 202448 | 202924 | 213312 | 248526 | 158774 | 218391 | 196432 | 272868 | 55186 | 53901 | 67992 | 73830 | 150503 | 66040 | 159428 |
A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency
We make use of the histogram for each column in each dataframe provided by Kaggle: https://www.kaggle.com/competitions/home-credit-default-risk/data
Using boxplot, we can easily see the outliers of each column in the dataset
def box_plot(df):
columns = df.select_dtypes(exclude="object").columns
fig, axs = plt.subplots(len(columns), 1, figsize = (5, 5 * len(columns)))
for i in range(len(columns)):
column_name = columns[i]
sns.boxplot(column_name, data = df, ax = axs[i])
box_plot(application_train)
The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
application_train_correlation = application_train.corr()
application_train_correlation.head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 1.000000 | -0.002108 | -0.001129 | -0.001820 | -0.000343 | -0.000433 | -0.000232 | 0.000849 | -0.001500 | 0.001366 | -0.000973 | -0.000384 | 0.001818 | 0.002804 | -0.001337 | -0.000415 | 0.002815 | 0.002753 | 0.000281 | -0.002895 | -0.001075 | -0.001138 | 0.000350 | -0.000283 | 0.001097 | 0.002903 | -0.001885 | -0.001582 | 0.000067 | 0.000082 | 0.002342 | 0.000222 | 0.001556 | -0.002070 | 0.001551 | 0.005900 | -0.001463 | 0.004862 | -0.002879 | 0.004851 | 0.003083 | 0.001465 | 0.003119 | 0.001770 | -0.002575 | 0.003042 | 0.001961 | -0.001411 | 0.001890 | 0.005245 | -0.001058 | 0.005017 | -0.002844 | 0.004386 | 0.002081 | 0.001548 | 0.003589 | 0.002156 | -0.001923 | 0.001920 | 0.001988 | -0.001647 | 0.001366 | 0.005777 | -0.001036 | 0.005067 | -0.002583 | 0.004588 | 0.002837 | 0.001699 | 0.003272 | 0.002205 | -0.003020 | 0.002440 | 0.002288 | -0.001409 | -0.000082 | -0.001423 | 0.001187 | -0.000858 | 0.000700 | -0.003411 | -0.004139 | -0.001097 | 0.002121 | -0.002694 | 0.001809 | 0.001505 | -0.000815 | -0.002012 | -0.001045 | 0.000896 | -0.001077 | 0.002604 | -0.000724 | 0.001450 | 0.000509 | 0.000167 | 0.001073 | 0.000282 | -0.002672 | -0.002193 | 0.002099 | 0.000485 | 0.001025 | 0.004659 |
| TARGET | -0.002108 | 1.000000 | 0.019187 | -0.003982 | -0.030369 | -0.012817 | -0.039645 | -0.037227 | 0.078239 | -0.044932 | 0.041975 | 0.051457 | 0.037612 | 0.000534 | 0.045982 | 0.028524 | 0.000370 | -0.023806 | -0.001758 | 0.009308 | 0.058899 | 0.060893 | -0.024166 | 0.005576 | 0.006942 | 0.002819 | 0.044395 | 0.050994 | 0.032518 | -0.155317 | -0.160472 | -0.178919 | -0.029498 | -0.022746 | -0.009728 | -0.022149 | -0.018550 | -0.034199 | -0.019172 | -0.044003 | -0.033614 | -0.010885 | -0.025031 | -0.032997 | -0.003176 | -0.013578 | -0.027284 | -0.019952 | -0.009036 | -0.022068 | -0.016340 | -0.032131 | -0.017387 | -0.043226 | -0.032698 | -0.010174 | -0.023393 | -0.030685 | -0.001557 | -0.012711 | -0.029184 | -0.022081 | -0.009993 | -0.022326 | -0.018573 | -0.033863 | -0.019025 | -0.043768 | -0.033394 | -0.011256 | -0.024621 | -0.032739 | -0.002757 | -0.013337 | -0.032596 | 0.009131 | 0.032248 | 0.009022 | 0.031276 | 0.055218 | 0.005417 | 0.044346 | -0.002672 | -0.000316 | -0.028602 | -0.001520 | -0.008040 | -0.004352 | -0.001414 | -0.004229 | -0.000756 | -0.011583 | -0.009464 | -0.006536 | -0.011615 | -0.003378 | -0.007952 | -0.001358 | 0.000215 | 0.003709 | 0.000930 | 0.002704 | 0.000788 | -0.012462 | -0.002022 | 0.019930 |
| CNT_CHILDREN | -0.001129 | 0.019187 | 1.000000 | 0.012882 | 0.002145 | 0.021374 | -0.001827 | -0.025573 | 0.330938 | -0.239818 | 0.183395 | -0.028019 | 0.008494 | 0.001041 | 0.240714 | 0.055630 | -0.000794 | -0.029906 | 0.022619 | 0.879161 | 0.025423 | 0.024781 | -0.007292 | -0.013319 | 0.008185 | 0.014835 | 0.020072 | 0.070650 | 0.069957 | -0.138470 | -0.018015 | -0.042710 | -0.013222 | -0.008464 | 0.006902 | 0.030172 | 0.000140 | -0.007060 | -0.008341 | -0.009705 | -0.008753 | -0.003121 | -0.008648 | -0.010116 | 0.004051 | 0.000028 | -0.012105 | -0.008513 | 0.006199 | 0.029549 | 0.000440 | -0.006397 | -0.006880 | -0.009550 | -0.008044 | -0.002212 | -0.007955 | -0.009517 | 0.004077 | 0.000231 | -0.012985 | -0.008799 | 0.006401 | 0.030124 | 0.000609 | -0.006747 | -0.008325 | -0.009447 | -0.008231 | -0.002820 | -0.007962 | -0.010067 | 0.004133 | 0.000061 | -0.008037 | 0.015593 | -0.001262 | 0.015232 | -0.001861 | -0.005865 | 0.001786 | 0.056837 | -0.003709 | -0.016737 | -0.157024 | -0.001498 | 0.051697 | -0.001997 | -0.002756 | -0.005318 | 0.000293 | 0.003945 | -0.005459 | 0.003609 | 0.010662 | 0.000773 | 0.004031 | 0.000864 | 0.000988 | -0.002450 | -0.000410 | -0.000366 | -0.002436 | -0.010808 | -0.007836 | -0.041550 |
| AMT_INCOME_TOTAL | -0.001820 | -0.003982 | 0.012882 | 1.000000 | 0.156870 | 0.191657 | 0.159610 | 0.074796 | 0.027261 | -0.064223 | 0.027805 | 0.008506 | -0.117273 | 0.000325 | 0.063994 | -0.017193 | -0.008290 | 0.000159 | 0.038378 | 0.016342 | -0.085465 | -0.091735 | 0.036459 | 0.031191 | 0.062340 | 0.058059 | 0.003574 | 0.006431 | 0.008285 | 0.026232 | 0.060925 | -0.030229 | 0.034501 | 0.017303 | 0.005658 | 0.042334 | 0.089616 | 0.045053 | 0.005394 | 0.060171 | 0.139860 | -0.001598 | 0.106920 | 0.039976 | 0.029520 | 0.074604 | 0.029994 | 0.012821 | 0.005284 | 0.037299 | 0.075625 | 0.041032 | 0.002027 | 0.057675 | 0.131800 | -0.003674 | 0.092991 | 0.034915 | 0.025020 | 0.061778 | 0.033798 | 0.016381 | 0.005639 | 0.042004 | 0.087918 | 0.044160 | 0.004787 | 0.059682 | 0.138489 | -0.001892 | 0.104914 | 0.039261 | 0.028098 | 0.070844 | 0.041985 | -0.013099 | -0.013244 | -0.013015 | -0.013135 | -0.018585 | -0.001000 | -0.016751 | 0.000529 | 0.001507 | -0.045878 | 0.003825 | 0.072451 | 0.018389 | 0.000290 | 0.002315 | 0.002540 | 0.022747 | 0.020708 | 0.010793 | 0.007269 | 0.002230 | 0.003130 | 0.002408 | 0.000242 | -0.000589 | 0.000709 | 0.002944 | 0.002387 | 0.024700 | 0.004859 | 0.011690 |
| AMT_CREDIT | -0.000343 | -0.030369 | 0.002145 | 0.156870 | 1.000000 | 0.770138 | 0.986968 | 0.099738 | -0.055436 | -0.066838 | 0.009621 | -0.006575 | -0.094191 | 0.001436 | 0.065519 | -0.021085 | 0.023653 | 0.026213 | 0.016632 | 0.063160 | -0.101776 | -0.110915 | 0.052738 | 0.024010 | 0.051929 | 0.052609 | -0.026886 | -0.018856 | 0.000081 | 0.168429 | 0.131228 | 0.043516 | 0.060439 | 0.039226 | 0.006249 | 0.035875 | 0.049537 | 0.080635 | 0.014929 | 0.103296 | 0.078832 | 0.006218 | 0.058788 | 0.072146 | 0.014362 | 0.037885 | 0.053072 | 0.031213 | 0.004804 | 0.033478 | 0.042341 | 0.074740 | 0.009361 | 0.100418 | 0.075485 | 0.002532 | 0.051208 | 0.064142 | 0.011106 | 0.032390 | 0.058682 | 0.037281 | 0.005765 | 0.035589 | 0.048565 | 0.079094 | 0.013692 | 0.102770 | 0.078375 | 0.005415 | 0.057058 | 0.070860 | 0.013402 | 0.035829 | 0.072818 | 0.000190 | -0.021229 | 0.000239 | -0.023767 | -0.073701 | 0.008905 | 0.096365 | 0.000630 | -0.011756 | -0.046717 | -0.004040 | 0.082819 | 0.022602 | -0.003100 | 0.028986 | 0.003857 | 0.052429 | 0.048828 | 0.032252 | 0.061925 | 0.011743 | 0.034329 | 0.021082 | 0.031023 | -0.016148 | -0.003906 | 0.004238 | -0.001275 | 0.054451 | 0.015925 | -0.048448 |
We will filter which couple of variables have very strong relationship, whose correlation > 0.9
def correl_09(df):
df_correlation = application_train.corr()
# Filtering the df
df_thr_09 = df_correlation[np.abs(df_correlation) > 0.9]
couples = []
cor_cofs = []
# Get the values
for i in range(df_correlation.shape[0]):
for j in range(df_correlation.shape[1]):
if not np.isnan(df_thr_09.iloc[i, j]) and df_thr_09.iloc[i, j] != 1:
couples.append((df_thr_09.columns[i], df_thr_09.index[j]))
cor_cofs.append(df_thr_09.iloc[i, j])
# Make df contains information
correl_table = pd.DataFrame(list(zip(couples, cor_cofs)),
columns=['couple', 'correlation'])
return correl_table.sort_values(by=['correlation'])
correl_09(application_train)
| couple | correlation | |
|---|---|---|
| 2 | (DAYS_EMPLOYED, FLAG_EMP_PHONE) | -0.999755 |
| 3 | (FLAG_EMP_PHONE, DAYS_EMPLOYED) | -0.999755 |
| 32 | (LIVINGAPARTMENTS_AVG, APARTMENTS_MODE) | 0.908278 |
| 46 | (APARTMENTS_MODE, LIVINGAPARTMENTS_AVG) | 0.908278 |
| 75 | (LIVINGAREA_MODE, APARTMENTS_MODE) | 0.910376 |
| ... | ... | ... |
| 102 | (FLOORSMIN_MEDI, FLOORSMIN_AVG) | 0.997241 |
| 122 | (OBS_30_CNT_SOCIAL_CIRCLE, OBS_60_CNT_SOCIAL_CIRCLE) | 0.998490 |
| 123 | (OBS_60_CNT_SOCIAL_CIRCLE, OBS_30_CNT_SOCIAL_CIRCLE) | 0.998490 |
| 18 | (YEARS_BUILD_AVG, YEARS_BUILD_MEDI) | 0.998495 |
| 92 | (YEARS_BUILD_MEDI, YEARS_BUILD_AVG) | 0.998495 |
124 rows × 2 columns
Because of these variables have very high correlation with another one, we choose to drop one of two variables in a couple. Then, we will drop these following columns in the cleaning part:
"AMT_GOODS_PRICE", "FLAG_EMP_PHONE", "REGION_RATING_CLIENT", "APARTMENTS_MODE", "LIVINGAPARTMENTS_MODE",
"APARTMENTS_MEDI", "LIVINGAPARTMENTS_MEDI", "LIVINGAREA_MEDI", "BASEMENTAREA_MODE", "BASEMENTAREA_MEDI",
"YEARS_BEGINEXPLUATATION_MODE", "YEARS_BEGINEXPLUATATION_MEDI", "YEARS_BUILD_MODE", "YEARS_BUILD_MEDI",
"COMMONAREA_MODE", "COMMONAREA_MEDI", "ELEVATORS_MODE", "ELEVATORS_MEDI", "ENTRANCES_MODE", "ENTRANCES_MEDI",
"FLOORSMAX_MODE", "FLOORSMAX_MEDI", "FLOORSMIN_MODE", "FLOORSMIN_MEDI", "LANDAREA_MODE", "LANDAREA_MEDI",
"APARTMENTS_MODE", "LIVINGAPARTMENTS_MODE", "APARTMENTS_MEDI", "LIVINGAPARTMENTS_MEDI", "LIVINGAREA_MODE",
"LIVINGAREA_MEDI", "TOTALAREA_MODE", "NONLIVINGAPARTMENTS_MODE", "NONLIVINGAPARTMENTS_MEDI", "OBS_30_CNT_SOCIAL_CIRCLE",
"NONLIVINGAREA_MEDI", "NONLIVINGAREA_MODE".
Some columns have only one value which dominated, it takes up to more than 95% of value
application_train["FLAG_MOBIL"].value_counts(normalize = True)
1 0.999997 0 0.000003 Name: FLAG_MOBIL, dtype: float64
application_train["FLAG_CONT_MOBILE"].value_counts(normalize = True)
1 0.998133 0 0.001867 Name: FLAG_CONT_MOBILE, dtype: float64
application_train["REG_REGION_NOT_LIVE_REGION"].value_counts(normalize = True)
0 0.984856 1 0.015144 Name: REG_REGION_NOT_LIVE_REGION, dtype: float64
for col in ['FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21']:
print(application_train[col].value_counts(normalize = True))
0 0.999958 1 0.000042 Name: FLAG_DOCUMENT_2, dtype: float64 1 0.710023 0 0.289977 Name: FLAG_DOCUMENT_3, dtype: float64 0 0.999919 1 0.000081 Name: FLAG_DOCUMENT_4, dtype: float64 0 0.984885 1 0.015115 Name: FLAG_DOCUMENT_5, dtype: float64 0 0.911945 1 0.088055 Name: FLAG_DOCUMENT_6, dtype: float64 0 0.999808 1 0.000192 Name: FLAG_DOCUMENT_7, dtype: float64 0 0.918624 1 0.081376 Name: FLAG_DOCUMENT_8, dtype: float64 0 0.996104 1 0.003896 Name: FLAG_DOCUMENT_9, dtype: float64 0 0.999977 1 0.000023 Name: FLAG_DOCUMENT_10, dtype: float64 0 0.996088 1 0.003912 Name: FLAG_DOCUMENT_11, dtype: float64 0 0.999993 1 0.000007 Name: FLAG_DOCUMENT_12, dtype: float64 0 0.996475 1 0.003525 Name: FLAG_DOCUMENT_13, dtype: float64 0 0.997064 1 0.002936 Name: FLAG_DOCUMENT_14, dtype: float64 0 0.99879 1 0.00121 Name: FLAG_DOCUMENT_15, dtype: float64 0 0.990072 1 0.009928 Name: FLAG_DOCUMENT_16, dtype: float64 0 0.999733 1 0.000267 Name: FLAG_DOCUMENT_17, dtype: float64 0 0.99187 1 0.00813 Name: FLAG_DOCUMENT_18, dtype: float64 0 0.999405 1 0.000595 Name: FLAG_DOCUMENT_19, dtype: float64 0 0.999493 1 0.000507 Name: FLAG_DOCUMENT_20, dtype: float64 0 0.999665 1 0.000335 Name: FLAG_DOCUMENT_21, dtype: float64
Because of these variables have very low variance (the dataset fall mostly on one value). Then, we will drop these following columns in the cleaning part:
'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'REG_REGION_NOT_LIVE_REGION', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_7'
Because DAYS_EMPLOYED have lot of outliers, we wil drop this column in the later part.
The 'bureau.csv' represents all clients' previous credits provided by other financial instutions that were reported to Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
# size of dataset
bureau.shape
(1716428, 17)
bureau.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB
Next we can look at the number and percentage of missing values in each column.
# Missing values statistics
missing_values_bureau = missing_values_table(bureau)
missing_values_bureau
Your selected dataframe has 17 columns. There are 7 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| AMT_ANNUITY | 1226791 | 71.5 |
| AMT_CREDIT_MAX_OVERDUE | 1124488 | 65.5 |
| DAYS_ENDDATE_FACT | 633653 | 36.9 |
| AMT_CREDIT_SUM_LIMIT | 591780 | 34.5 |
| AMT_CREDIT_SUM_DEBT | 257669 | 15.0 |
| DAYS_CREDIT_ENDDATE | 105553 | 6.1 |
| AMT_CREDIT_SUM | 13 | 0.0 |
Describe some basic statistics such as frequency, mean, std, IQR, min, max,...
# Numeric colums
bureau.describe()
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.610875e+06 | 1.082775e+06 | 5.919400e+05 | 1.716428e+06 | 1.716415e+06 | 1.458759e+06 | 1.124648e+06 | 1.716428e+06 | 1.716428e+06 | 4.896370e+05 |
| mean | 2.782149e+05 | 5.924434e+06 | -1.142108e+03 | 8.181666e-01 | 5.105174e+02 | -1.017437e+03 | 3.825418e+03 | 6.410406e-03 | 3.549946e+05 | 1.370851e+05 | 6.229515e+03 | 3.791276e+01 | -5.937483e+02 | 1.571276e+04 |
| std | 1.029386e+05 | 5.322657e+05 | 7.951649e+02 | 3.654443e+01 | 4.994220e+03 | 7.140106e+02 | 2.060316e+05 | 9.622391e-02 | 1.149811e+06 | 6.774011e+05 | 4.503203e+04 | 5.937650e+03 | 7.207473e+02 | 3.258269e+05 |
| min | 1.000010e+05 | 5.000000e+06 | -2.922000e+03 | 0.000000e+00 | -4.206000e+04 | -4.202300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.705600e+06 | -5.864061e+05 | 0.000000e+00 | -4.194700e+04 | 0.000000e+00 |
| 25% | 1.888668e+05 | 5.463954e+06 | -1.666000e+03 | 0.000000e+00 | -1.138000e+03 | -1.489000e+03 | 0.000000e+00 | 0.000000e+00 | 5.130000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.080000e+02 | 0.000000e+00 |
| 50% | 2.780550e+05 | 5.926304e+06 | -9.870000e+02 | 0.000000e+00 | -3.300000e+02 | -8.970000e+02 | 0.000000e+00 | 0.000000e+00 | 1.255185e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -3.950000e+02 | 0.000000e+00 |
| 75% | 3.674260e+05 | 6.385681e+06 | -4.740000e+02 | 0.000000e+00 | 4.740000e+02 | -4.250000e+02 | 0.000000e+00 | 0.000000e+00 | 3.150000e+05 | 4.015350e+04 | 0.000000e+00 | 0.000000e+00 | -3.300000e+01 | 1.350000e+04 |
| max | 4.562550e+05 | 6.843457e+06 | 0.000000e+00 | 2.792000e+03 | 3.119900e+04 | 0.000000e+00 | 1.159872e+08 | 9.000000e+00 | 5.850000e+08 | 1.701000e+08 | 4.705600e+06 | 3.756681e+06 | 3.720000e+02 | 1.184534e+08 |
# Object/categorical columns
bureau.describe(include=["object", "category"])
| CREDIT_ACTIVE | CREDIT_CURRENCY | CREDIT_TYPE | |
|---|---|---|---|
| count | 1716428 | 1716428 | 1716428 |
| unique | 4 | 4 | 15 |
| top | Closed | currency 1 | Consumer credit |
| freq | 1079273 | 1715020 | 1251615 |
The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
plt.figure(figsize=(12,8))
sns.heatmap(bureau.corr(), annot = True)
plt.show()
The highest correlation is 0.88, this represents the strong linear relationship of DAYS_CREDIT and DAYS_ENDATE_FACT.
The 'bureau_balance.csv' represents monthly balance of previous credits in Credit Bureau. This table has one row for each month of history of every previous credit reported to Credit Bureau - i.e the table has (#loans in sample # of relative previous credits # of months where we have some history observable for the previous credits) rows.
# size of dataset
bureau_balance.shape
(27299925, 3)
bureau_balance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB
Next we can look at the number and percentage of missing values in each column.
# Missing values statistics
missing_values_bureau_balance = missing_values_table(bureau_balance)
missing_values_bureau_balance
Your selected dataframe has 3 columns. There are 0 columns that have missing values.
| Missing Values | % of Total Values |
|---|
A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency
bureau_balance.hist()
plt.show()
Using boxplot, we can easily see the outliers of each column in the dataset
box_plot(bureau_balance)
We can clearly see that, there are no outliers in bureau_balance dataframe.
Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample # of relative previous credit cards # of months where we have some history observable for the previous credit card) rows.
# SIZE OF DATA
credit_card_balance.shape
(3840312, 23)
# DATATYPE OF EACH COLUMN
credit_card_balance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB
Next we can look at the number and percentage of missing values in each column.
# Missing values statistics
missing_values_credit_card_balance = missing_values_table(credit_card_balance)
missing_values_credit_card_balance
Your selected dataframe has 23 columns. There are 9 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| AMT_PAYMENT_CURRENT | 767988 | 20.0 |
| AMT_DRAWINGS_ATM_CURRENT | 749816 | 19.5 |
| AMT_DRAWINGS_OTHER_CURRENT | 749816 | 19.5 |
| AMT_DRAWINGS_POS_CURRENT | 749816 | 19.5 |
| CNT_DRAWINGS_ATM_CURRENT | 749816 | 19.5 |
| CNT_DRAWINGS_OTHER_CURRENT | 749816 | 19.5 |
| CNT_DRAWINGS_POS_CURRENT | 749816 | 19.5 |
| AMT_INST_MIN_REGULARITY | 305236 | 7.9 |
| CNT_INSTALMENT_MATURE_CUM | 305236 | 7.9 |
Next we can look at the number and values of unique values in categorical column: Contract status
# CONTRACT STATUS
credit_card_balance['NAME_CONTRACT_STATUS'].unique()
array(['Active', 'Completed', 'Demand', 'Signed', 'Sent proposal',
'Refused', 'Approved'], dtype=object)
Describe some basic statistics such as frequency, mean, std, IQR, min, max,...
# Numeric colums
credit_card_balance.describe()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3.072324e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3.840312e+06 | 3.840312e+06 |
| mean | 1.904504e+06 | 2.783242e+05 | -3.452192e+01 | 5.830016e+04 | 1.538080e+05 | 5.961325e+03 | 7.433388e+03 | 2.881696e+02 | 2.968805e+03 | 3.540204e+03 | 1.028054e+04 | 7.588857e+03 | 5.596588e+04 | 5.808881e+04 | 5.809829e+04 | 3.094490e-01 | 7.031439e-01 | 4.812496e-03 | 5.594791e-01 | 2.082508e+01 | 9.283667e+00 | 3.316220e-01 |
| std | 5.364695e+05 | 1.027045e+05 | 2.666775e+01 | 1.063070e+05 | 1.651457e+05 | 2.822569e+04 | 3.384608e+04 | 8.201989e+03 | 2.079689e+04 | 5.600154e+03 | 3.607808e+04 | 3.200599e+04 | 1.025336e+05 | 1.059654e+05 | 1.059718e+05 | 1.100401e+00 | 3.190347e+00 | 8.263861e-02 | 3.240649e+00 | 2.005149e+01 | 9.751570e+01 | 2.147923e+01 |
| min | 1.000018e+06 | 1.000060e+05 | -9.600000e+01 | -4.202502e+05 | 0.000000e+00 | -6.827310e+03 | -6.211620e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.233058e+05 | -4.202502e+05 | -4.202502e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434385e+06 | 1.895170e+05 | -5.500000e+01 | 0.000000e+00 | 4.500000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.523700e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.897122e+06 | 2.783960e+05 | -2.800000e+01 | 0.000000e+00 | 1.125000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.702700e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.369328e+06 | 3.675800e+05 | -1.100000e+01 | 8.904669e+04 | 1.800000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.633911e+03 | 9.000000e+03 | 6.750000e+03 | 8.535924e+04 | 8.889949e+04 | 8.891451e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.200000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843496e+06 | 4.562500e+05 | -1.000000e+00 | 1.505902e+06 | 1.350000e+06 | 2.115000e+06 | 2.287098e+06 | 1.529847e+06 | 2.239274e+06 | 2.028820e+05 | 4.289207e+06 | 4.278316e+06 | 1.472317e+06 | 1.493338e+06 | 1.493338e+06 | 5.100000e+01 | 1.650000e+02 | 1.200000e+01 | 1.650000e+02 | 1.200000e+02 | 3.260000e+03 | 3.260000e+03 |
# Object/categorical columns
credit_card_balance.describe(include=["object", "category"])
| NAME_CONTRACT_STATUS | |
|---|---|
| count | 3840312 |
| unique | 7 |
| top | Active |
| freq | 3698436 |
The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
plt.figure(figsize=(12,8))
matrix = np.triu(credit_card_balance.corr())
sns.heatmap(credit_card_balance.corr(), annot=True, mask= matrix)
plt.show()
correl_09(credit_card_balance)
| couple | correlation | |
|---|---|---|
| 2 | (DAYS_EMPLOYED, FLAG_EMP_PHONE) | -0.999755 |
| 3 | (FLAG_EMP_PHONE, DAYS_EMPLOYED) | -0.999755 |
| 32 | (LIVINGAPARTMENTS_AVG, APARTMENTS_MODE) | 0.908278 |
| 46 | (APARTMENTS_MODE, LIVINGAPARTMENTS_AVG) | 0.908278 |
| 75 | (LIVINGAREA_MODE, APARTMENTS_MODE) | 0.910376 |
| ... | ... | ... |
| 102 | (FLOORSMIN_MEDI, FLOORSMIN_AVG) | 0.997241 |
| 122 | (OBS_30_CNT_SOCIAL_CIRCLE, OBS_60_CNT_SOCIAL_CIRCLE) | 0.998490 |
| 123 | (OBS_60_CNT_SOCIAL_CIRCLE, OBS_30_CNT_SOCIAL_CIRCLE) | 0.998490 |
| 18 | (YEARS_BUILD_AVG, YEARS_BUILD_MEDI) | 0.998495 |
| 92 | (YEARS_BUILD_MEDI, YEARS_BUILD_AVG) | 0.998495 |
124 rows × 2 columns
These couple have correlation > 0.9:
[('AMT_PAYMENT_CURRENT', 'AMT_PAYMENT_TOTAL_CURRENT'), ('AMT_BALANCE', 'AMT_RECEIVABLE_PRINCIPAL'), ('AMT_BALANCE', 'AMT_RECIVABLE'), ('AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE'), ('AMT_BALANCE', 'AMT_TOTAL_RECEIVABLE'), ('AMT_RECEIVABLE_PRINCIPAL', 'AMT_TOTAL_RECEIVABLE'), ('AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE'), ('CNT_DRAWINGS_CURRENT', 'CNT_DRAWINGS_POS_CURRENT')]
Some columns can be presented by other columns, then in later part, we will drop these columns.
The "installments_payments.csv" represents repayment history for the previously disbursed credits in Home Credit related to the loans in the sample.
There is a) one row for every payment that was made plus b) one row each for missed payment.
One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
# SIZE OF THE DATA
print('\nSize of installments_payments data:', installments_payments.shape)
Size of installments_payments data: (13605401, 8)
# DATATYPE OF EACH COLUMN
installments_payments.dtypes
SK_ID_PREV int64 SK_ID_CURR int64 NUM_INSTALMENT_VERSION float64 NUM_INSTALMENT_NUMBER int64 DAYS_INSTALMENT float64 DAYS_ENTRY_PAYMENT float64 AMT_INSTALMENT float64 AMT_PAYMENT float64 dtype: object
Next we can look at the number and percentage of missing values in each column.
missing_values_installments_payments = missing_values_table(installments_payments) missing_values_installments_payments
Two columns: the day and the amount customers actually paid (DAYS_ENTRY_PAYMENT and AMT_PAYMENT) include the same amount of null values
A bit more than 0.02% of clients haven't paid previous credit on this installment
Describe some basic statistics such as frequency, mean, std, IQR, min, max,...
# Numeric colums
installments_payments.describe()
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| count | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360250e+07 | 1.360540e+07 | 1.360250e+07 |
| mean | 1.903365e+06 | 2.784449e+05 | 8.566373e-01 | 1.887090e+01 | -1.042270e+03 | -1.051114e+03 | 1.705091e+04 | 1.723822e+04 |
| std | 5.362029e+05 | 1.027183e+05 | 1.035216e+00 | 2.666407e+01 | 8.009463e+02 | 8.005859e+02 | 5.057025e+04 | 5.473578e+04 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 1.000000e+00 | -2.922000e+03 | -4.921000e+03 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434191e+06 | 1.896390e+05 | 0.000000e+00 | 4.000000e+00 | -1.654000e+03 | -1.662000e+03 | 4.226085e+03 | 3.398265e+03 |
| 50% | 1.896520e+06 | 2.786850e+05 | 1.000000e+00 | 8.000000e+00 | -8.180000e+02 | -8.270000e+02 | 8.884080e+03 | 8.125515e+03 |
| 75% | 2.369094e+06 | 3.675300e+05 | 1.000000e+00 | 1.900000e+01 | -3.610000e+02 | -3.700000e+02 | 1.671021e+04 | 1.610842e+04 |
| max | 2.843499e+06 | 4.562550e+05 | 1.780000e+02 | 2.770000e+02 | -1.000000e+00 | -1.000000e+00 | 3.771488e+06 | 3.771488e+06 |
That the average AMT_INSTALLMENTS (the prescribed installment amount on this installment is) smaller than AMT_PAYMENTS (the amount client actually paid on this installment) suggests:
Some loans are not repaid on time
A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency
installments_payments.iloc[:,4:6].hist(bins=20)
array([[<AxesSubplot:title={'center':'DAYS_INSTALMENT'}>,
<AxesSubplot:title={'center':'DAYS_ENTRY_PAYMENT'}>]],
dtype=object)
The histogram of prescribed installment day and actually paid day also suggests late payment of the customer
The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”
plt.figure(figsize=(12,8))
matrix = np.triu(installments_payments.corr())
sns.heatmap(installments_payments.corr(), annot=True, mask= matrix)
plt.show()
We will not drop any columns.
SK_ID_PREV, SK_ID_CURR, NUM_INSTALMENT_VERSION and NUM_INSTALMENT_NUMBER are composite key
AMT_PAYMENT và AMT_INSTALLMENT have high correlation but are the important variables so we keep it to analyse more.
Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit. This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample # of relative previous credits # of months in which we have some history observable for the previous credits) rows.
POS_CASH_balance.shape
(10001358, 8)
POS_CASH_balance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB
Next we can look at the number and percentage of missing values in each column.
missing_values_table(POS_CASH_balance)
Your selected dataframe has 8 columns. There are 2 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| CNT_INSTALMENT_FUTURE | 26087 | 0.3 |
| CNT_INSTALMENT | 26071 | 0.3 |
Next we can look at the number and values of unique values in each column.
unique_values_table(POS_CASH_balance).iloc[:, :1]
| number of unique values | |
|---|---|
| SK_ID_PREV | 936325 |
| SK_ID_CURR | 337252 |
| MONTHS_BALANCE | 96 |
| CNT_INSTALMENT | 73 |
| CNT_INSTALMENT_FUTURE | 79 |
| NAME_CONTRACT_STATUS | 9 |
| SK_DPD | 3400 |
| SK_DPD_DEF | 2307 |
We focus on the categorical column: Contract status
POS_CASH_balance["NAME_CONTRACT_STATUS"].unique()
array(['Active', 'Completed', 'Signed', 'Approved',
'Returned to the store', 'Demand', 'Canceled', 'XNA',
'Amortized debt'], dtype=object)
#Select all the columns that have 'object' type to extract information from each features
categorical_features_lst = POS_CASH_balance.select_dtypes(["object"]).columns.tolist()
for feature in categorical_features_lst:
fig, ax = plt.subplots(figsize = (20, 10))
if POS_CASH_balance[feature].nunique() < 10:
sns.countplot(data = POS_CASH_balance, x = feature)
else:
sns.countplot(data = POS_CASH_balance, y = feature)
ax.set_title("Count plot of each level of the feature: " + feature)
In this dataset, there is only one categorical feature that indicates the contract status on the previous application. We can see that most previous application are still active.
Using boxplot, we can easily see the outliers of each column in the dataset
box_plot(POS_CASH_balance)
We can see that SK_DPD, SK_DPD_DEF, CNT_INSTALMENT, CNT_INSTALMENT_FUTURE have outlier values
plt.figure(figsize=(12,8))
matrix = np.triu(POS_CASH_balance.corr())
sns.heatmap(POS_CASH_balance.corr(), annot=True, mask= matrix)
plt.show()
Here, we can see that CNT_INSTALMENT and CNT_INTSALMENT_FUTURE are highly correlated (0.87)
Some columns have zeros which dominated, it takes up to more than 97% of value
#Show the percentage of zero values in two columns 'SK_DPD', 'SK_DPD_DEF'
percent_zero = pd.DataFrame()
percent1 = (POS_CASH_balance[POS_CASH_balance['SK_DPD'] == 0]['SK_DPD'].value_counts().sum()/POS_CASH_balance.shape[0])* 100
percent2 = (POS_CASH_balance[POS_CASH_balance['SK_DPD_DEF'] == 0]['SK_DPD_DEF'].value_counts().sum()/POS_CASH_balance.shape[0])* 100
percent_zero['Percentage of zero values'] = percent1, percent2
percent_zero.index = ['SK_DPD', 'SK_DPD_DEF']
percent_zero
| Percentage of zero values | |
|---|---|
| SK_DPD | 97.048131 |
| SK_DPD_DEF | 98.860465 |
SK_DPD_DEF vs SK_DPD have a lot of 0 values (nearly 100%), so we will delete these features later.
All previous applications for Home Credit loans of clients who have loans in our sample. There is one row for each previous application related to loans in our data sample.
previous_application.shape
(1670214, 37)
Next we can look at the number and percentage of missing values in each column.
missing_values_table(previous_application)
Your selected dataframe has 37 columns. There are 16 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| RATE_INTEREST_PRIMARY | 1664263 | 99.6 |
| RATE_INTEREST_PRIVILEGED | 1664263 | 99.6 |
| AMT_DOWN_PAYMENT | 895844 | 53.6 |
| RATE_DOWN_PAYMENT | 895844 | 53.6 |
| NAME_TYPE_SUITE | 820405 | 49.1 |
| DAYS_FIRST_DRAWING | 673065 | 40.3 |
| DAYS_FIRST_DUE | 673065 | 40.3 |
| DAYS_LAST_DUE_1ST_VERSION | 673065 | 40.3 |
| DAYS_LAST_DUE | 673065 | 40.3 |
| DAYS_TERMINATION | 673065 | 40.3 |
| NFLAG_INSURED_ON_APPROVAL | 673065 | 40.3 |
| AMT_GOODS_PRICE | 385515 | 23.1 |
| AMT_ANNUITY | 372235 | 22.3 |
| CNT_PAYMENT | 372230 | 22.3 |
| PRODUCT_COMBINATION | 346 | 0.0 |
| AMT_CREDIT | 1 | 0.0 |
missing_values_sr = previous_application.isnull().sum()
missing_values_df = missing_values_sr.loc[missing_values_sr > 0].sort_values(ascending = False).reset_index()
missing_values_df.columns = ["Feature", "Number of missing values"]
missing_values_df["Percentage of missing values"] = (missing_values_df["Number of missing values"] / previous_application.shape[0]) * 100
sns.barplot(x = missing_values_df["Percentage of missing values"], y = missing_values_df["Feature"])
plt.title("Percentage of missing values in the previous applications data")
Text(0.5, 1.0, 'Percentage of missing values in the previous applications data')
#Select all the columns that have 'object' type to extract information from each features
categorical_features_lst = previous_application.select_dtypes(["object"]).columns.tolist()
for feature in categorical_features_lst:
fig, ax = plt.subplots(figsize = (10, 5))
# Plot levels distribution
if previous_application[feature].nunique() < 10:
sns.countplot(data = previous_application, x = feature, order = previous_application[feature].value_counts().index.tolist())
else:
sns.countplot(data = previous_application, y = feature, order = previous_application[feature].value_counts().index.tolist())
ax.set_title("Count plot of each level of the feature: " + feature)
We can get useful insights from the plots above:
for feature in ['DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE']:
fig, ax = plt.subplots(1, 1, figsize = (10, 7))
plt.boxplot(previous_application[feature].dropna(), patch_artist =True, vert = False)
ax.set_title("Boxplot of the feature: " + feature)
These columns have too much outliers.
plt.figure(figsize=(15,15))
matrix = np.triu(previous_application.corr())
sns.heatmap(previous_application.corr(), annot=True, mask= matrix, cmap="mako" , alpha = 0.5, fmt = ".3f", square = True)
plt.show()
Here, we can see several interesting things:
We will delete 1 feature in each couple features which have a correlation above |0.9|
Columns with high percentage of missing values:
Columns with high correlation with each other, we will consider to remove one of variable in each couple.
Between:
In conclusion, in the next part, we will remove these columns because of correlation reason:
Other reasons
#Show the percentage of Y values in two columns 'SK_DPD', 'SK_DPD_DEF'
percent_Y = (previous_application[previous_application['FLAG_LAST_APPL_PER_CONTRACT'] == 'Y']['FLAG_LAST_APPL_PER_CONTRACT'].value_counts().sum()/previous_application.shape[0])* 100
print("Percentage of Y value:", percent_Y)
#Show the percentage of XAP values in column 'CODE_REJECT_REASON'
percent_XAP = (previous_application[previous_application['CODE_REJECT_REASON'] == 'XAP']['CODE_REJECT_REASON'].value_counts().sum()/previous_application.shape[0])* 100
print("Percentage of XAP value:", percent_XAP)
Percentage of Y value: 99.49257999274344 Percentage of XAP value: 81.01315160811728
previous_application["CODE_REJECT_REASON"].value_counts(normalize = True)
XAP 0.810132 HC 0.104915 LIMIT 0.033337 SCO 0.022432 CLIENT 0.015828 SCOFR 0.007670 XNA 0.003140 VERIF 0.002116 SYSTEM 0.000429 Name: CODE_REJECT_REASON, dtype: float64
# Show the percentage of XAP and XNA values in column 'NAME_CASH_LOAN_PURPOSE'
previous_application['NAME_CASH_LOAN_PURPOSE'].value_counts(normalize = True)
XAP 0.552421 XNA 0.405887 Repairs 0.014229 Other 0.009345 Urgent needs 0.005036 Buying a used car 0.001729 Building a house or an annex 0.001612 Everyday expenses 0.001447 Medicine 0.001302 Payments on other loans 0.001156 Education 0.000942 Journey 0.000742 Purchase of electronic equipment 0.000635 Buying a new car 0.000606 Wedding / gift / holiday 0.000576 Buying a home 0.000518 Car repairs 0.000477 Furniture 0.000448 Buying a holiday home / land 0.000319 Business development 0.000255 Gasification / water supply 0.000180 Buying a garage 0.000081 Hobby 0.000033 Money for a third person 0.000015 Refusal to name the goal 0.000009 Name: NAME_CASH_LOAN_PURPOSE, dtype: float64
So that, in the next part, we drop these features:
We drop some columns because of the reasons we stated in the previous part. We do this column-cleaning part first because it will help us reduce lots of heavy work after combining the dataset!
to_drop_application = ['AMT_GOODS_PRICE',
'FLAG_EMP_PHONE',
'REGION_RATING_CLIENT',
'APARTMENTS_MODE',
'LIVINGAPARTMENTS_MODE',
'APARTMENTS_MEDI',
'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI',
'BASEMENTAREA_MODE',
'BASEMENTAREA_MEDI',
'YEARS_BEGINEXPLUATATION_MODE',
'YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MODE',
'YEARS_BUILD_MEDI',
'COMMONAREA_MODE',
'COMMONAREA_MEDI',
'ELEVATORS_MODE',
'ELEVATORS_MEDI',
'ENTRANCES_MODE',
'ENTRANCES_MEDI',
'FLOORSMAX_MODE',
'FLOORSMAX_MEDI',
'FLOORSMIN_MODE',
'FLOORSMIN_MEDI',
'LANDAREA_MODE',
'LANDAREA_MEDI',
'APARTMENTS_MODE',
'LIVINGAPARTMENTS_MODE',
'APARTMENTS_MEDI',
'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MODE',
'LIVINGAREA_MEDI',
'TOTALAREA_MODE',
'NONLIVINGAPARTMENTS_MODE',
'NONLIVINGAPARTMENTS_MEDI',
'OBS_30_CNT_SOCIAL_CIRCLE',
'NONLIVINGAREA_MEDI',
'NONLIVINGAREA_MODE',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21',
'FLAG_MOBIL',
'FLAG_CONT_MOBILE',
'REG_REGION_NOT_LIVE_REGION',
'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_7',
'DAYS_EMPLOYED']
print(application_train.shape)
application_train.drop(columns = to_drop_application, inplace = True)
print(application_train.shape)
application_train.head(5)
(307511, 122) (307511, 68)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -3648.0 | -2120 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1186.0 | -291 | NaN | 0 | 1 | 0 | Core staff | 2.0 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | reg oper account | block of flats | Block | No | 0.0 | 1.0 | 0.0 | -828.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -4260.0 | -2531 | 26.0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -9833.0 | -2437 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -617.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -4311.0 | -3458 | NaN | 0 | 0 | 0 | Core staff | 1.0 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
print(application_test.shape)
application_test.drop(columns = to_drop_application, inplace = True)
print(application_test.shape)
application_test.head(5)
(48744, 121) (48744, 67)
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -5170.0 | -812 | NaN | 0 | 0 | 1 | NaN | 2.0 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.0660 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | NaN | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -1740.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -9118.0 | -1623 | NaN | 0 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -2175.0 | -3503 | 5.0 | 0 | 0 | 0 | Drivers | 2.0 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | NaN | 0.699787 | 0.610991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -2000.0 | -4208 | NaN | 0 | 1 | 0 | Sales staff | 4.0 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.3052 | 0.1974 | 0.9970 | 0.9592 | 0.1165 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2042 | 0.2404 | 0.3673 | 0.0386 | 0.08 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -1805.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -4000.0 | -4262 | 16.0 | 1 | 0 | 0 | NaN | 3.0 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -821.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
In these tables, we do not have to drop any columns
to_drop_credit_card_balance = []
to_drop_credit_card_balance.append('AMT_PAYMENT_CURRENT')
to_drop_credit_card_balance.append('AMT_RECEIVABLE_PRINCIPAL')
to_drop_credit_card_balance.append('AMT_RECIVABLE')
to_drop_credit_card_balance.append('AMT_BALANCE')
to_drop_credit_card_balance.append('CNT_DRAWINGS_POS_CURRENT')
to_drop_credit_card_balance.append('CNT_DRAWINGS_OTHER_CURRENT')
to_drop_credit_card_balance.append('CNT_DRAWINGS_ATM_CURRENT')
to_drop_credit_card_balance.append('AMT_DRAWINGS_POS_CURRENT')
to_drop_credit_card_balance.append('AMT_DRAWINGS_OTHER_CURRENT')
to_drop_credit_card_balance.append('AMT_DRAWINGS_ATM_CURRENT')
to_drop_credit_card_balance
['AMT_PAYMENT_CURRENT', 'AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE', 'AMT_BALANCE', 'CNT_DRAWINGS_POS_CURRENT', 'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_POS_CURRENT', 'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_ATM_CURRENT']
print("old shape: ", credit_card_balance.shape)
credit_card_balance.drop(columns = to_drop_credit_card_balance, inplace = True)
print("new shape: ", credit_card_balance.shape)
credit_card_balance.head(5)
old shape: (3840312, 23) new shape: (3840312, 13)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_TOTAL_CURRENT | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 135000 | 877.5 | 1700.325 | 1800.0 | 0.000 | 1 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 45000 | 2250.0 | 2250.000 | 2250.0 | 64875.555 | 1 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 450000 | 0.0 | 2250.000 | 2250.0 | 31460.085 | 0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 225000 | 2250.0 | 11795.760 | 11925.0 | 233048.970 | 1 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 450000 | 11547.0 | 22924.890 | 27000.0 | 453919.455 | 1 | 101.0 | Active | 0 | 0 |
print("old shape: ", POS_CASH_balance.shape)
POS_CASH_balance.drop(['SK_DPD_DEF','SK_DPD'], inplace=True, axis = 1)
print("new shape: ", POS_CASH_balance.shape)
POS_CASH_balance.head(5)
old shape: (10001358, 8) new shape: (10001358, 6)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | |
|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active |
null_percent = (previous_application.isnull().sum()/previous_application.shape[0])*100
null_percent[null_percent > 90]
RATE_INTEREST_PRIMARY 99.643698 RATE_INTEREST_PRIVILEGED 99.643698 dtype: float64
previous_application.shape
(1670214, 37)
to_drop_previous_application = ["RATE_INTEREST_PRIMARY", "RATE_INTEREST_PRIVILEGED", 'DAYS_TERMINATION', 'AMT_GOODS_PRICE',
'AMT_APPLICATION','FLAG_LAST_APPL_PER_CONTRACT', 'CODE_REJECT_REASON', 'NAME_CASH_LOAN_PURPOSE',
'DAYS_FIRST_DRAWING',
'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE']
print("old shape: ", previous_application.shape)
previous_application.drop(to_drop_previous_application, inplace=True, axis = 1)
print("new shape: ", previous_application.shape)
previous_application.head(5)
old shape: (1670214, 37) new shape: (1670214, 25)
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_CREDIT | AMT_DOWN_PAYMENT | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 0.0 | SATURDAY | 15 | 1 | 0.0 | Approved | -73 | Cash through the bank | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 679671.0 | NaN | THURSDAY | 11 | 1 | NaN | Approved | -164 | XNA | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 136444.5 | NaN | TUESDAY | 11 | 1 | NaN | Approved | -301 | Cash through the bank | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 470790.0 | NaN | MONDAY | 7 | 1 | NaN | Approved | -512 | Cash through the bank | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 404055.0 | NaN | THURSDAY | 9 | 1 | NaN | Refused | -781 | Cash through the bank | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN |
This is kinda tricky as data has different format and is represented differently
Our head dataset will be called 'data' which is just a merge of train and test
Other pieces we will add after some processing
data = train + test
this one is easy: both datasets have exactly the same format with only TARGET column being present in train set as the only difference
application_train.shape
(307511, 68)
application_test.shape
(48744, 67)
data = application_train.append(application_test)
data.shape
(356255, 68)
data.iloc[356200 : 356210, :]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48689 | 455787 | NaN | Cash loans | M | Y | N | 0 | 202500.0 | 312840.0 | 20124.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.010276 | -14757 | -8836.0 | -5069 | 8.0 | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.206775 | 0.413597 | 0.0711 | 0.0445 | 0.9727 | 0.6260 | NaN | 0.0 | 0.1724 | 0.1250 | 0.1667 | 0.0477 | NaN | 0.0839 | NaN | 0.0528 | reg oper account | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -1762.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 |
| 48690 | 455798 | NaN | Cash loans | F | N | Y | 1 | 162000.0 | 135000.0 | 7452.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.022625 | -19057 | -54.0 | -2595 | NaN | 1 | 0 | 0 | Laborers | 3.0 | 2 | TUESDAY | 12 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.598238 | 0.517297 | 0.2052 | 0.2113 | 0.9781 | 0.7008 | 0.1003 | 0.0 | 0.4828 | 0.1667 | 0.2083 | 0.2084 | 0.1673 | 0.2043 | 0.0 | 0.0000 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -1438.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 48691 | 455802 | NaN | Cash loans | M | Y | N | 2 | 360000.0 | 540000.0 | 42664.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.002506 | -11344 | -6102.0 | -3273 | 20.0 | 0 | 0 | 0 | Drivers | 4.0 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.600753 | NaN | 0.0660 | 0.0982 | 0.9796 | NaN | NaN | NaN | 0.1379 | 0.1250 | NaN | 0.0466 | NaN | NaN | NaN | 0.0777 | NaN | block of flats | Block | No | 0.0 | 0.0 | 0.0 | -428.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 48692 | 455803 | NaN | Cash loans | F | N | Y | 1 | 157500.0 | 260640.0 | 29605.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.007305 | -11887 | -5121.0 | -4192 | NaN | 0 | 0 | 1 | Cooking staff | 2.0 | 3 | TUESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.297975 | 0.343131 | 0.457900 | 0.0082 | NaN | 0.9598 | NaN | NaN | 0.0 | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0103 | NaN | NaN | NaN | block of flats | Wooden | No | 0.0 | 0.0 | 0.0 | -245.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 5.0 |
| 48693 | 455804 | NaN | Cash loans | F | Y | Y | 1 | 315000.0 | 312840.0 | 24844.5 | Unaccompanied | Commercial associate | Secondary / secondary special | Widow | House / apartment | 0.006629 | -17312 | -3598.0 | -859 | 9.0 | 0 | 0 | 0 | NaN | 2.0 | 2 | WEDNESDAY | 7 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 1 | NaN | 0.618161 | 0.547810 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | -1345.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 48694 | 455805 | NaN | Cash loans | F | Y | Y | 0 | 337500.0 | 517500.0 | 20907.0 | Unaccompanied | Pensioner | Secondary / secondary special | Married | House / apartment | 0.026392 | -20633 | -6767.0 | -4192 | 4.0 | 0 | 0 | 0 | NaN | 2.0 | 2 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.689457 | 0.588488 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1147.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 48695 | 455829 | NaN | Cash loans | F | N | Y | 0 | 135000.0 | 450000.0 | 25258.5 | Unaccompanied | Pensioner | Higher education | Married | House / apartment | 0.026392 | -21833 | -9301.0 | -4752 | NaN | 0 | 0 | 0 | NaN | 2.0 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | XNA | 0.882225 | 0.751873 | 0.652897 | 0.0701 | NaN | 0.9722 | NaN | NaN | 0.0 | 0.1379 | 0.1667 | NaN | NaN | NaN | 0.0552 | NaN | NaN | NaN | block of flats | Stone, brick | No | 0.0 | 1.0 | 0.0 | -403.0 | 0 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 48696 | 455840 | NaN | Cash loans | M | Y | Y | 0 | 135000.0 | 257391.0 | 27157.5 | Unaccompanied | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | 0.010276 | -10376 | -4270.0 | -2407 | 11.0 | 0 | 0 | 1 | Laborers | 1.0 | 2 | WEDNESDAY | 11 | 0 | 0 | 1 | 1 | 0 | Self-employed | 0.214167 | 0.591297 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1757.0 | 0 | 0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 48697 | 455849 | NaN | Cash loans | F | N | N | 0 | 225000.0 | 360000.0 | 37800.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.011657 | -14664 | -2856.0 | -2141 | NaN | 1 | 0 | 0 | NaN | 2.0 | 1 | TUESDAY | 11 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | NaN | 0.610207 | NaN | 0.0113 | 0.0000 | 0.9776 | NaN | NaN | 0.0 | 0.0690 | 0.0417 | NaN | 0.0000 | NaN | 0.0114 | NaN | 0.0000 | NaN | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -1738.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 48698 | 455850 | NaN | Cash loans | F | N | Y | 0 | 297000.0 | 225000.0 | 13896.0 | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | 0.018209 | -24168 | -6959.0 | -4201 | NaN | 0 | 0 | 0 | NaN | 1.0 | 3 | WEDNESDAY | 6 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.726431 | 0.301625 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1317.0 | 0 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
# now just in case, let's check if we've got it right
data.TARGET.isna().sum() # same as number of test rows
48744
sum(data.SK_ID_CURR[data.TARGET.isna()] == application_test.SK_ID_CURR) # all is good
48744
sum(data.SK_ID_CURR.isin(application_test.SK_ID_CURR)) == len(application_test) # nothing else to prove
True
bureau balance -> bureau -> data
Before we merge data with bureau, we need to merge bureau dataframe with related information in bureau_balance file
What is the exact problem here:
bureau dataframe comes from the Credit Bureau authority and displays one row for each credit the client from train/test dataset has taken previously. It is matched by SK_ID_CURR with train/test and where in train/test the SK_ID_CURR do not duplicate (1 for 1 client whom we are trying to classify) in most cases bureau dataframe has multipe indicies of the same client as he/she had applied to multiple loans previously.
in turn bureau_balance even more extends the previous credit information on a greater scale. It contains a separate row for each month of history of every previous credit reported to Credit Bureau (bureau dataframe) and is related to bureau df via SK_ID_BUREAU.
So the approach we are going to use is to calculate mean of each statistical column out of these both dataframes to include these mean values as features of our clients whom we are trying to classify. For example: mean days overdue for all credits that the client had previously taken.
I have to say that this approach leaves out some information such as categorical columns in some cases. For example the client with SK_ID_CURR = 666 had 7 credits in bureau dataframe, and when we collapse all these credits (grouped by one ID) into one line to indicate mean values for these credits, we will not be able to show a CREDIT_ACTIVE column that has different categorical values as Closed or Active for different previous credits. So this leaves room for some interesting feature engineering here.
Steps that we need to take:
Even though we've decided not to perform any feature engineering, one useful feature here is just asking for it. Let's calculate the total number of previous credits taken by each client and include this in our statistics. I believe that kind of information would be quite useful. So let's quickly do that before executing our program defined above
Add a new column PREVIOUS_LOANS_COUNT of each client (SK_ID_CURR) by groupby the table 'bureau'
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'PREVIOUS_LOANS_COUNT'})
previous_loan_counts.head()
| SK_ID_CURR | PREVIOUS_LOANS_COUNT | |
|---|---|---|
| 0 | 100001 | 7 |
| 1 | 100002 | 8 |
| 2 | 100003 | 4 |
| 3 | 100004 | 2 |
| 4 | 100005 | 3 |
data = data.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')
data.head(25)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1.0 | Cash loans | M | N | Y | 0 | 202500.000 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -3648.0 | -2120 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
| 1 | 100003 | 0.0 | Cash loans | F | N | N | 0 | 270000.000 | 1293502.5 | 35698.5 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1186.0 | -291 | NaN | 0 | 1 | 0 | Core staff | 2.0 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | reg oper account | block of flats | Block | No | 0.0 | 1.0 | 0.0 | -828.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
| 2 | 100004 | 0.0 | Revolving loans | M | Y | Y | 0 | 67500.000 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -4260.0 | -2531 | 26.0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 3 | 100006 | 0.0 | Cash loans | F | N | Y | 0 | 135000.000 | 312682.5 | 29686.5 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -9833.0 | -2437 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -617.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0.0 | Cash loans | M | N | Y | 0 | 121500.000 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -4311.0 | -3458 | NaN | 0 | 0 | 0 | Core staff | 1.0 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 5 | 100008 | 0.0 | Cash loans | M | N | Y | 0 | 99000.000 | 490495.5 | 27517.5 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | -16941 | -4970.0 | -477 | NaN | 1 | 1 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | Other | NaN | 0.354225 | 0.621226 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -2536.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 |
| 6 | 100009 | 0.0 | Cash loans | F | Y | Y | 1 | 171000.000 | 1560726.0 | 41301.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.035792 | -13778 | -1213.0 | -619 | 17.0 | 0 | 1 | 0 | Accountants | 3.0 | 2 | SUNDAY | 16 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.774761 | 0.724000 | 0.492060 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | -1562.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 18.0 |
| 7 | 100010 | 0.0 | Cash loans | M | Y | Y | 0 | 360000.000 | 1530000.0 | 42075.0 | Unaccompanied | State servant | Higher education | Married | House / apartment | 0.003122 | -18850 | -4597.0 | -2379 | 8.0 | 1 | 0 | 0 | Managers | 2.0 | 3 | MONDAY | 16 | 0 | 0 | 0 | 1 | 1 | Other | NaN | 0.714279 | 0.540654 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -1070.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 8 | 100011 | 0.0 | Cash loans | F | N | Y | 0 | 112500.000 | 1019610.0 | 33826.5 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.018634 | -20099 | -7427.0 | -3514 | NaN | 0 | 0 | 0 | NaN | 2.0 | 2 | WEDNESDAY | 14 | 0 | 0 | 0 | 0 | 0 | XNA | 0.587334 | 0.205747 | 0.751724 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 9 | 100012 | 0.0 | Revolving loans | M | N | Y | 0 | 135000.000 | 405000.0 | 20250.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.019689 | -14469 | -14437.0 | -3992 | NaN | 0 | 0 | 0 | Laborers | 1.0 | 2 | THURSDAY | 8 | 0 | 0 | 0 | 0 | 0 | Electricity | NaN | 0.746644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -1673.0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 100014 | 0.0 | Cash loans | F | N | Y | 1 | 112500.000 | 652500.0 | 21177.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.022800 | -10197 | -4427.0 | -738 | NaN | 0 | 0 | 0 | Core staff | 3.0 | 2 | SATURDAY | 15 | 0 | 0 | 0 | 0 | 0 | Medicine | 0.319760 | 0.651862 | 0.363945 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -844.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 8.0 |
| 11 | 100015 | 0.0 | Cash loans | F | N | Y | 0 | 38419.155 | 148365.0 | 10678.5 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.015221 | -20417 | -5246.0 | -2512 | NaN | 0 | 1 | 0 | NaN | 2.0 | 2 | FRIDAY | 7 | 0 | 0 | 0 | 0 | 0 | XNA | 0.722044 | 0.555183 | 0.652897 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -2396.0 | 0 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 4.0 |
| 12 | 100016 | 0.0 | Cash loans | F | N | Y | 0 | 67500.000 | 80865.0 | 5881.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.031329 | -13439 | -311.0 | -3227 | NaN | 1 | 1 | 0 | Laborers | 2.0 | 2 | FRIDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 2 | 0.464831 | 0.715042 | 0.176653 | 0.0825 | NaN | 0.9811 | NaN | NaN | 0.00 | 0.2069 | 0.1667 | NaN | 0.0135 | NaN | 0.0778 | NaN | 0.0000 | reg oper account | block of flats | NaN | No | 0.0 | 0.0 | 0.0 | -2370.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 7.0 |
| 13 | 100017 | 0.0 | Cash loans | M | Y | N | 1 | 225000.000 | 918468.0 | 28966.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.016612 | -14086 | -643.0 | -4911 | 23.0 | 0 | 0 | 0 | Drivers | 3.0 | 2 | THURSDAY | 13 | 0 | 0 | 0 | 0 | 0 | Self-employed | NaN | 0.566907 | 0.770087 | 0.1474 | 0.0973 | 0.9806 | 0.7348 | 0.0582 | 0.16 | 0.1379 | 0.3333 | 0.3750 | 0.0931 | 0.1202 | 0.1397 | 0.0000 | 0.0000 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -4.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 |
| 14 | 100018 | 0.0 | Cash loans | F | N | Y | 0 | 189000.000 | 773680.5 | 32778.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010006 | -14583 | -615.0 | -2056 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 1 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Transport: type 2 | 0.721940 | 0.642656 | NaN | 0.3495 | 0.1335 | 0.9985 | 0.9796 | 0.1143 | 0.40 | 0.1724 | 0.6667 | 0.7083 | 0.1758 | 0.2849 | 0.3774 | 0.0193 | 0.1001 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -188.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | 100019 | 0.0 | Cash loans | M | Y | Y | 0 | 157500.000 | 299772.0 | 20160.0 | Family | Working | Secondary / secondary special | Single / not married | Rented apartment | 0.020713 | -8728 | -3494.0 | -1368 | 17.0 | 0 | 0 | 0 | Laborers | 1.0 | 3 | SATURDAY | 6 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 2 | 0.115634 | 0.346634 | 0.678568 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -925.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 16 | 100020 | 0.0 | Cash loans | M | N | N | 0 | 108000.000 | 509602.5 | 26149.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.018634 | -12931 | -6392.0 | -3866 | NaN | 0 | 0 | 0 | Drivers | 2.0 | 2 | THURSDAY | 12 | 0 | 0 | 1 | 1 | 0 | Government | NaN | 0.236378 | 0.062103 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -3.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4.0 |
| 17 | 100021 | 0.0 | Revolving loans | F | N | Y | 1 | 81000.000 | 270000.0 | 13500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010966 | -9776 | -4143.0 | -2427 | NaN | 0 | 0 | 0 | Laborers | 3.0 | 2 | MONDAY | 10 | 0 | 0 | 1 | 1 | 0 | Construction | NaN | 0.683513 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 4.0 | 0.0 | -2811.0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | 100022 | 0.0 | Revolving loans | F | N | Y | 0 | 112500.000 | 157500.0 | 7875.0 | Other_A | Working | Secondary / secondary special | Widow | House / apartment | 0.046220 | -17718 | -8751.0 | -1259 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 1 | FRIDAY | 13 | 0 | 0 | 0 | 0 | 0 | Housing | NaN | 0.706428 | 0.556727 | 0.0278 | 0.0617 | 0.9881 | 0.8368 | 0.0018 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0279 | 0.0227 | 0.0290 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 0.0 | 8.0 | 0.0 | -239.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 19 | 100023 | 0.0 | Cash loans | F | N | Y | 1 | 90000.000 | 544491.0 | 17563.5 | Unaccompanied | State servant | Higher education | Single / not married | House / apartment | 0.015221 | -11348 | -1021.0 | -3964 | NaN | 1 | 1 | 0 | Core staff | 2.0 | 2 | MONDAY | 12 | 0 | 0 | 0 | 0 | 0 | Kindergarten | NaN | 0.586617 | 0.477649 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1850.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 13.0 |
| 20 | 100024 | 0.0 | Revolving loans | M | Y | Y | 0 | 135000.000 | 427500.0 | 21375.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.015221 | -18252 | -298.0 | -1800 | 7.0 | 0 | 0 | 0 | Laborers | 2.0 | 2 | FRIDAY | 13 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.565655 | 0.113375 | NaN | 0.0722 | 0.0801 | 0.9781 | 0.7008 | NaN | 0.00 | 0.1379 | 0.1667 | 0.0417 | 0.0534 | 0.0588 | 0.0619 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -296.0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | 100025 | 0.0 | Cash loans | F | Y | Y | 1 | 202500.000 | 1132573.5 | 37561.5 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.025164 | -14815 | -2299.0 | -2299 | 14.0 | 0 | 0 | 0 | Sales staff | 3.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Trade: type 7 | 0.437709 | 0.233767 | 0.542445 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 1.0 |
| 22 | 100026 | 0.0 | Cash loans | F | N | N | 1 | 450000.000 | 497520.0 | 32521.5 | Unaccompanied | Working | Secondary / secondary special | Married | Rented apartment | 0.020713 | -11146 | -114.0 | -2518 | NaN | 0 | 0 | 0 | Sales staff | 3.0 | 2 | THURSDAY | 6 | 0 | 0 | 0 | 0 | 0 | Self-employed | NaN | 0.457143 | 0.358951 | 0.0907 | 0.0795 | 0.9786 | 0.7076 | 0.0120 | 0.00 | 0.2069 | 0.1667 | 0.2083 | 0.0898 | 0.0723 | 0.0873 | 0.0077 | 0.0044 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -468.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 3.0 |
| 23 | 100027 | 0.0 | Cash loans | F | N | Y | 0 | 83250.000 | 239850.0 | 23850.0 | Unaccompanied | Pensioner | Secondary / secondary special | Married | House / apartment | 0.006296 | -24827 | -9012.0 | -3684 | NaN | 0 | 1 | 0 | NaN | 2.0 | 3 | FRIDAY | 12 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.624305 | 0.669057 | 0.1443 | 0.0848 | 0.9876 | 0.8300 | 0.1064 | 0.14 | 0.1207 | 0.3750 | 0.4167 | 0.2371 | 0.1173 | 0.1484 | 0.0019 | 0.0007 | org spec account | block of flats | Mixed | No | 0.0 | 0.0 | 0.0 | -795.0 | 0 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 |
| 24 | 100029 | 0.0 | Cash loans | M | Y | N | 2 | 135000.000 | 247500.0 | 12703.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -11286 | -108.0 | -3729 | 7.0 | 0 | 0 | 0 | Drivers | 4.0 | 2 | THURSDAY | 14 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | NaN | 0.786179 | 0.565608 | 0.1433 | 0.1455 | 0.9861 | 0.8096 | 0.0212 | 0.00 | 0.3103 | 0.1667 | 0.2083 | 0.0861 | 0.1168 | 0.1217 | 0.0000 | 0.0043 | reg oper account | block of flats | Panel | No | 1.0 | 1.0 | 0.0 | -4.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
Now back to merging with all the bureau and bureau_balance information
bureau_balance.head()
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
# first define the formula for grouping rows by ID and calculating mean values
def extract_mean(x):
y = x.groupby('SK_ID_BUREAU', as_index=False).mean().add_prefix('BUR_BAL_MEAN_')
return y
# apply formula to create bureau_balance dataframe grouped by SK_ID_BUREAU with mean values of all numerical columns
bureau_bal_mean = extract_mean(bureau_balance)
bureau_bal_mean.head()
| BUR_BAL_MEAN_SK_ID_BUREAU | BUR_BAL_MEAN_MONTHS_BALANCE | |
|---|---|---|
| 0 | 5001709 | -48.0 |
| 1 | 5001710 | -41.0 |
| 2 | 5001711 | -1.5 |
| 3 | 5001712 | -9.0 |
| 4 | 5001713 | -10.5 |
As you can see, this dataframe does not include the bureau_balance categorical column STATUS.
Also note that our formula has changed the name of the SK_ID_BUREAU, we need to change it back in order to use it when merging with bureau df.
One might argue that we didn't need to add this .add_prefix(...) to our formula above, but when working with larger datasets below it will prove itself useful
bureau_bal_mean = bureau_bal_mean.rename(columns = {'BUR_BAL_MEAN_SK_ID_BUREAU' : 'SK_ID_BUREAU'})
bureau_bal_mean.head()
| SK_ID_BUREAU | BUR_BAL_MEAN_MONTHS_BALANCE | |
|---|---|---|
| 0 | 5001709 | -48.0 |
| 1 | 5001710 | -41.0 |
| 2 | 5001711 | -1.5 |
| 3 | 5001712 | -9.0 |
| 4 | 5001713 | -10.5 |
bureau.head()
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_bal_mean.head()
| SK_ID_BUREAU | BUR_BAL_MEAN_MONTHS_BALANCE | |
|---|---|---|
| 0 | 5001709 | -48.0 |
| 1 | 5001710 | -41.0 |
| 2 | 5001711 | -1.5 |
| 3 | 5001712 | -9.0 |
| 4 | 5001713 | -10.5 |
bureau = bureau.merge(bureau_bal_mean, on = 'SK_ID_BUREAU', how = 'left')
bureau.drop('SK_ID_BUREAU', axis = 1, inplace = True) # we don't need this internal ID anymore
# kiểm tra xem có đúng là bureau đã được merge với bureau_bal_mean chưa?
bureau.head()
| SK_ID_CURR | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | BUR_BAL_MEAN_MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN | NaN |
| 1 | 215354 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN | NaN |
| 2 | 215354 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN | NaN |
| 3 | 215354 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN | NaN |
| 4 | 215354 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN | NaN |
bureau.shape
(1716428, 17)
bureau["SK_ID_CURR"].nunique()
305811
# Chứng tỏ là SK_ID_CURR đang chưa unique, nên giờ sẽ groupby để về giá trị unique
def extract_mean(x):
y = x.groupby('SK_ID_CURR', as_index=False).mean().add_prefix('PREV_BUR_MEAN_') # note that we have changed the ID to group by and the prefix to add
return y
bureau_mean_values = extract_mean(bureau)
bureau_mean_values = bureau_mean_values.rename(columns = {'PREV_BUR_MEAN_SK_ID_CURR' : 'SK_ID_CURR'})
bureau_mean_values.head(10)
| SK_ID_CURR | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | -735.000000 | 0.0 | 82.428571 | -825.500000 | NaN | 0.0 | 207623.571429 | 85240.928571 | 0.00000 | 0.0 | -93.142857 | 3545.357143 | -11.785714 |
| 1 | 100002 | -874.000000 | 0.0 | -349.000000 | -697.500000 | 1681.029 | 0.0 | 108131.945625 | 49156.200000 | 7997.14125 | 0.0 | -499.875000 | 0.000000 | -21.875000 |
| 2 | 100003 | -1400.750000 | 0.0 | -544.500000 | -1097.333333 | 0.000 | 0.0 | 254350.125000 | 0.000000 | 202500.00000 | 0.0 | -816.000000 | NaN | NaN |
| 3 | 100004 | -867.000000 | 0.0 | -488.500000 | -532.500000 | 0.000 | 0.0 | 94518.900000 | 0.000000 | 0.00000 | 0.0 | -532.000000 | NaN | NaN |
| 4 | 100005 | -190.666667 | 0.0 | 439.333333 | -123.000000 | 0.000 | 0.0 | 219042.000000 | 189469.500000 | 0.00000 | 0.0 | -54.333333 | 1420.500000 | -3.000000 |
| 5 | 100007 | -1149.000000 | 0.0 | -783.000000 | -783.000000 | 0.000 | 0.0 | 146250.000000 | 0.000000 | 0.00000 | 0.0 | -783.000000 | NaN | NaN |
| 6 | 100008 | -757.333333 | 0.0 | -391.333333 | -909.000000 | 0.000 | 0.0 | 156148.500000 | 80019.000000 | 0.00000 | 0.0 | -611.000000 | NaN | NaN |
| 7 | 100009 | -1271.500000 | 0.0 | -794.937500 | -1108.500000 | 0.000 | 0.0 | 266711.750000 | 76953.535714 | 0.00000 | 0.0 | -851.611111 | NaN | NaN |
| 8 | 100010 | -1939.500000 | 0.0 | -119.500000 | -1138.000000 | NaN | 0.0 | 495000.000000 | 174003.750000 | 0.00000 | 0.0 | -578.000000 | NaN | -46.000000 |
| 9 | 100011 | -1773.000000 | 0.0 | -1293.250000 | -1463.250000 | 5073.615 | 0.0 | 108807.075000 | 0.000000 | 0.00000 | 0.0 | -1454.750000 | NaN | NaN |
bureau_mean_values.shape
(305811, 14)
bureau_mean_values["SK_ID_CURR"].nunique()
305811
Looks good. There are a few missing values although which we will deal with later
data.shape
(356255, 69)
data = data.merge(bureau_mean_values, on = 'SK_ID_CURR', how = 'left')
data.shape
(356255, 82)
So here we've created 13 new features and added them to our train/test dataset called 'data'
data.head(20)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1.0 | Cash loans | M | N | Y | 0 | 202500.000 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -3648.0 | -2120 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | -874.000000 | 0.0 | -349.000000 | -697.500000 | 1681.0290 | 0.0 | 108131.945625 | 49156.200000 | 7997.14125 | 0.0 | -499.875000 | 0.0 | -21.875 |
| 1 | 100003 | 0.0 | Cash loans | F | N | N | 0 | 270000.000 | 1293502.5 | 35698.5 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1186.0 | -291 | NaN | 0 | 1 | 0 | Core staff | 2.0 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | reg oper account | block of flats | Block | No | 0.0 | 1.0 | 0.0 | -828.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | -1400.750000 | 0.0 | -544.500000 | -1097.333333 | 0.0000 | 0.0 | 254350.125000 | 0.000000 | 202500.00000 | 0.0 | -816.000000 | NaN | NaN |
| 2 | 100004 | 0.0 | Revolving loans | M | Y | Y | 0 | 67500.000 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -4260.0 | -2531 | 26.0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | -867.000000 | 0.0 | -488.500000 | -532.500000 | 0.0000 | 0.0 | 94518.900000 | 0.000000 | 0.00000 | 0.0 | -532.000000 | NaN | NaN |
| 3 | 100006 | 0.0 | Cash loans | F | N | Y | 0 | 135000.000 | 312682.5 | 29686.5 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -9833.0 | -2437 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -617.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0.0 | Cash loans | M | N | Y | 0 | 121500.000 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -4311.0 | -3458 | NaN | 0 | 0 | 0 | Core staff | 1.0 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1149.000000 | 0.0 | -783.000000 | -783.000000 | 0.0000 | 0.0 | 146250.000000 | 0.000000 | 0.00000 | 0.0 | -783.000000 | NaN | NaN |
| 5 | 100008 | 0.0 | Cash loans | M | N | Y | 0 | 99000.000 | 490495.5 | 27517.5 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | -16941 | -4970.0 | -477 | NaN | 1 | 1 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | Other | NaN | 0.354225 | 0.621226 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -2536.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | -757.333333 | 0.0 | -391.333333 | -909.000000 | 0.0000 | 0.0 | 156148.500000 | 80019.000000 | 0.00000 | 0.0 | -611.000000 | NaN | NaN |
| 6 | 100009 | 0.0 | Cash loans | F | Y | Y | 1 | 171000.000 | 1560726.0 | 41301.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.035792 | -13778 | -1213.0 | -619 | 17.0 | 0 | 1 | 0 | Accountants | 3.0 | 2 | SUNDAY | 16 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.774761 | 0.724000 | 0.492060 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | -1562.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 18.0 | -1271.500000 | 0.0 | -794.937500 | -1108.500000 | 0.0000 | 0.0 | 266711.750000 | 76953.535714 | 0.00000 | 0.0 | -851.611111 | NaN | NaN |
| 7 | 100010 | 0.0 | Cash loans | M | Y | Y | 0 | 360000.000 | 1530000.0 | 42075.0 | Unaccompanied | State servant | Higher education | Married | House / apartment | 0.003122 | -18850 | -4597.0 | -2379 | 8.0 | 1 | 0 | 0 | Managers | 2.0 | 3 | MONDAY | 16 | 0 | 0 | 0 | 1 | 1 | Other | NaN | 0.714279 | 0.540654 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -1070.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | -1939.500000 | 0.0 | -119.500000 | -1138.000000 | NaN | 0.0 | 495000.000000 | 174003.750000 | 0.00000 | 0.0 | -578.000000 | NaN | -46.000 |
| 8 | 100011 | 0.0 | Cash loans | F | N | Y | 0 | 112500.000 | 1019610.0 | 33826.5 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.018634 | -20099 | -7427.0 | -3514 | NaN | 0 | 0 | 0 | NaN | 2.0 | 2 | WEDNESDAY | 14 | 0 | 0 | 0 | 0 | 0 | XNA | 0.587334 | 0.205747 | 0.751724 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | -1773.000000 | 0.0 | -1293.250000 | -1463.250000 | 5073.6150 | 0.0 | 108807.075000 | 0.000000 | 0.00000 | 0.0 | -1454.750000 | NaN | NaN |
| 9 | 100012 | 0.0 | Revolving loans | M | N | Y | 0 | 135000.000 | 405000.0 | 20250.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.019689 | -14469 | -14437.0 | -3992 | NaN | 0 | 0 | 0 | Laborers | 1.0 | 2 | THURSDAY | 8 | 0 | 0 | 0 | 0 | 0 | Electricity | NaN | 0.746644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -1673.0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 100014 | 0.0 | Cash loans | F | N | Y | 1 | 112500.000 | 652500.0 | 21177.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.022800 | -10197 | -4427.0 | -738 | NaN | 0 | 0 | 0 | Core staff | 3.0 | 2 | SATURDAY | 15 | 0 | 0 | 0 | 0 | 0 | Medicine | 0.319760 | 0.651862 | 0.363945 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -844.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 8.0 | -1095.375000 | 0.0 | -387.375000 | -821.333333 | 3726.3525 | 0.0 | 341241.553125 | 151642.800000 | 0.00000 | 0.0 | -615.875000 | NaN | NaN |
| 11 | 100015 | 0.0 | Cash loans | F | N | Y | 0 | 38419.155 | 148365.0 | 10678.5 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.015221 | -20417 | -5246.0 | -2512 | NaN | 0 | 1 | 0 | NaN | 2.0 | 2 | FRIDAY | 7 | 0 | 0 | 0 | 0 | 0 | XNA | 0.722044 | 0.555183 | 0.652897 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -2396.0 | 0 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 4.0 | -947.750000 | 0.0 | -598.250000 | -555.500000 | NaN | 0.0 | 102373.875000 | 0.000000 | 0.00000 | 0.0 | -551.250000 | NaN | NaN |
| 12 | 100016 | 0.0 | Cash loans | F | N | Y | 0 | 67500.000 | 80865.0 | 5881.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.031329 | -13439 | -311.0 | -3227 | NaN | 1 | 1 | 0 | Laborers | 2.0 | 2 | FRIDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 2 | 0.464831 | 0.715042 | 0.176653 | 0.0825 | NaN | 0.9811 | NaN | NaN | 0.00 | 0.2069 | 0.1667 | NaN | 0.0135 | NaN | 0.0778 | NaN | 0.0000 | reg oper account | block of flats | NaN | No | 0.0 | 0.0 | 0.0 | -2370.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 7.0 | -618.428571 | 0.0 | -217.142857 | -929.666667 | 0.0000 | 0.0 | 67854.857143 | 12744.900000 | 0.00000 | 0.0 | -405.857143 | NaN | NaN |
| 13 | 100017 | 0.0 | Cash loans | M | Y | N | 1 | 225000.000 | 918468.0 | 28966.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.016612 | -14086 | -643.0 | -4911 | 23.0 | 0 | 0 | 0 | Drivers | 3.0 | 2 | THURSDAY | 13 | 0 | 0 | 0 | 0 | 0 | Self-employed | NaN | 0.566907 | 0.770087 | 0.1474 | 0.0973 | 0.9806 | 0.7348 | 0.0582 | 0.16 | 0.1379 | 0.3333 | 0.3750 | 0.0931 | 0.1202 | 0.1397 | 0.0000 | 0.0000 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -4.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | -1944.333333 | 0.0 | -1512.333333 | -1677.833333 | 0.0000 | 0.0 | 143295.000000 | 0.000000 | 0.00000 | 0.0 | -1594.333333 | NaN | NaN |
| 14 | 100018 | 0.0 | Cash loans | F | N | Y | 0 | 189000.000 | 773680.5 | 32778.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010006 | -14583 | -615.0 | -2056 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 1 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Transport: type 2 | 0.721940 | 0.642656 | NaN | 0.3495 | 0.1335 | 0.9985 | 0.9796 | 0.1143 | 0.40 | 0.1724 | 0.6667 | 0.7083 | 0.1758 | 0.2849 | 0.3774 | 0.0193 | 0.1001 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -188.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | 100019 | 0.0 | Cash loans | M | Y | Y | 0 | 157500.000 | 299772.0 | 20160.0 | Family | Working | Secondary / secondary special | Single / not married | Rented apartment | 0.020713 | -8728 | -3494.0 | -1368 | 17.0 | 0 | 0 | 0 | Laborers | 1.0 | 3 | SATURDAY | 6 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 2 | 0.115634 | 0.346634 | 0.678568 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -925.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | -495.000000 | 0.0 | 5441.000000 | NaN | 0.0000 | 0.0 | 360000.000000 | 122735.070000 | 135000.00000 | 0.0 | -26.500000 | 27000.0 | -8.000 |
| 16 | 100020 | 0.0 | Cash loans | M | N | N | 0 | 108000.000 | 509602.5 | 26149.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.018634 | -12931 | -6392.0 | -3866 | NaN | 0 | 0 | 0 | Drivers | 2.0 | 2 | THURSDAY | 12 | 0 | 0 | 1 | 1 | 0 | Government | NaN | 0.236378 | 0.062103 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -3.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4.0 | -261.500000 | 0.0 | -32.500000 | -223.000000 | 0.0000 | 0.0 | 49871.250000 | 30772.125000 | 0.00000 | 0.0 | -130.750000 | NaN | NaN |
| 17 | 100021 | 0.0 | Revolving loans | F | N | Y | 1 | 81000.000 | 270000.0 | 13500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010966 | -9776 | -4143.0 | -2427 | NaN | 0 | 0 | 0 | Laborers | 3.0 | 2 | MONDAY | 10 | 0 | 0 | 1 | 1 | 0 | Construction | NaN | 0.683513 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 4.0 | 0.0 | -2811.0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | 100022 | 0.0 | Revolving loans | F | N | Y | 0 | 112500.000 | 157500.0 | 7875.0 | Other_A | Working | Secondary / secondary special | Widow | House / apartment | 0.046220 | -17718 | -8751.0 | -1259 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 1 | FRIDAY | 13 | 0 | 0 | 0 | 0 | 0 | Housing | NaN | 0.706428 | 0.556727 | 0.0278 | 0.0617 | 0.9881 | 0.8368 | 0.0018 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0279 | 0.0227 | 0.0290 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 0.0 | 8.0 | 0.0 | -239.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | -337.000000 | 0.0 | 940.000000 | NaN | 0.0000 | 0.0 | 528750.000000 | 205276.500000 | 0.00000 | 0.0 | -28.000000 | NaN | NaN |
| 19 | 100023 | 0.0 | Cash loans | F | N | Y | 1 | 90000.000 | 544491.0 | 17563.5 | Unaccompanied | State servant | Higher education | Single / not married | House / apartment | 0.015221 | -11348 | -1021.0 | -3964 | NaN | 1 | 1 | 0 | Core staff | 2.0 | 2 | MONDAY | 12 | 0 | 0 | 0 | 0 | 0 | Kindergarten | NaN | 0.586617 | 0.477649 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1850.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 13.0 | -1164.384615 | 0.0 | -364.916667 | -997.300000 | 2720.8500 | 0.0 | 126591.718846 | 13703.850000 | 0.00000 | 0.0 | -634.076923 | NaN | NaN |
The next process is: instalments_payments, credit_card_balance, POS_CASH_balance => previous_application
Quick information on this block of data: surprisingly... previous_application reflects clients' previous applications for loans to Home Credit. As before, previous_application unfolds in a load of statistics with three other dataframes:
So the plan here would be the following:
But before we start, let's check if there are any records in previous_application that are not in our data?
len(previous_application.SK_ID_CURR.isin(data.SK_ID_CURR)) == len(previous_application)
True
looks good
One more thing! We will delete the SK_ID_CURR from the credit_card_balance / POS_CASH_balance / installment_payments as we do not need this column to be shown as mean, this information has no impact on statistics and will just clutter the space as noise. We will group them with our 'leading' dataset previous_application using SK_ID_PREV and our 'leading' dataset has this SK_ID_CURR key to be further mapped with our data.
credit_card_balance.drop('SK_ID_CURR', axis = 1, inplace = True)
installments_payments.drop('SK_ID_CURR', axis = 1, inplace = True)
POS_CASH_balance.drop('SK_ID_CURR', axis = 1, inplace = True)
As previously, before tearing apart the previous_applications to Home Credit statistics, let's extract the number of previous applications of the clients to Home Credit and add this feature to our data
previous_application_counts = previous_application.groupby('SK_ID_CURR', as_index=False)['SK_ID_PREV'].count().rename(columns = {'SK_ID_PREV': 'PREVIOUS_APPLICATION_COUNT'})
previous_application_counts.head()
| SK_ID_CURR | PREVIOUS_APPLICATION_COUNT | |
|---|---|---|
| 0 | 100001 | 1 |
| 1 | 100002 | 1 |
| 2 | 100003 | 3 |
| 3 | 100004 | 1 |
| 4 | 100005 | 2 |
# and throw that column in our data
data = data.merge(previous_application_counts, on = 'SK_ID_CURR', how = 'left')
data.head(5)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1.0 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -3648.0 | -2120 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | -874.00 | 0.0 | -349.0 | -697.500000 | 1681.029 | 0.0 | 108131.945625 | 49156.2 | 7997.14125 | 0.0 | -499.875 | 0.0 | -21.875 | 1.0 |
| 1 | 100003 | 0.0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1186.0 | -291 | NaN | 0 | 1 | 0 | Core staff | 2.0 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | reg oper account | block of flats | Block | No | 0.0 | 1.0 | 0.0 | -828.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | -1400.75 | 0.0 | -544.5 | -1097.333333 | 0.000 | 0.0 | 254350.125000 | 0.0 | 202500.00000 | 0.0 | -816.000 | NaN | NaN | 3.0 |
| 2 | 100004 | 0.0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -4260.0 | -2531 | 26.0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | -867.00 | 0.0 | -488.5 | -532.500000 | 0.000 | 0.0 | 94518.900000 | 0.0 | 0.00000 | 0.0 | -532.000 | NaN | NaN | 1.0 |
| 3 | 100006 | 0.0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -9833.0 | -2437 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -617.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9.0 |
| 4 | 100007 | 0.0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -4311.0 | -3458 | NaN | 0 | 0 | 0 | Core staff | 1.0 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1149.00 | 0.0 | -783.0 | -783.000000 | 0.000 | 0.0 | 146250.000000 | 0.0 | 0.00000 | 0.0 | -783.000 | NaN | NaN | 6.0 |
Now back to our process
def extract_mean(x):
y = x.groupby('SK_ID_PREV', as_index=False).mean().add_prefix('CARD_MEAN_')
return y
credit_card_balance_mean = extract_mean(credit_card_balance)
credit_card_balance_mean = credit_card_balance_mean.rename(columns = {'CARD_MEAN_SK_ID_PREV' : 'SK_ID_PREV'})
credit_card_balance_mean.head(10)
| SK_ID_PREV | CARD_MEAN_MONTHS_BALANCE | CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | CARD_MEAN_AMT_DRAWINGS_CURRENT | CARD_MEAN_AMT_INST_MIN_REGULARITY | CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | CARD_MEAN_AMT_TOTAL_RECEIVABLE | CARD_MEAN_CNT_DRAWINGS_CURRENT | CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | CARD_MEAN_SK_DPD | CARD_MEAN_SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000018 | -4.0 | 81000.000000 | 29478.996000 | 2594.088000 | 5541.750000 | 73602.585000 | 8.800000 | 2.000000 | 0.000000 | 0.000000 |
| 1 | 1000030 | -4.5 | 81562.500000 | 17257.438125 | 2078.223750 | 2657.947500 | 55935.376875 | 5.125000 | 1.875000 | 0.000000 | 0.000000 |
| 2 | 1000031 | -8.5 | 149625.000000 | 28959.615000 | 2675.300625 | 22157.443125 | 52099.970625 | 1.312500 | 3.687500 | 0.000000 | 0.000000 |
| 3 | 1000035 | -4.0 | 225000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 4 | 1000077 | -7.0 | 94090.909091 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 5 | 1000083 | -7.0 | 183461.538462 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 6 | 1000087 | -16.5 | 71718.750000 | 4278.474844 | 2242.807500 | 6165.232031 | 39085.408125 | 0.406250 | 9.166667 | 0.000000 | 0.000000 |
| 7 | 1000089 | -3.0 | 135000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 8 | 1000094 | -45.5 | 51392.045455 | 1575.011761 | 1993.890176 | 2915.807216 | 29386.505455 | 0.090909 | 33.670588 | 0.011364 | 0.011364 |
| 9 | 1000096 | -48.5 | 180000.000000 | 4334.961094 | 2743.612031 | 7619.022656 | 38567.323594 | 0.260417 | 34.562500 | 0.354167 | 0.000000 |
previous_application = previous_application.merge(credit_card_balance_mean, on = 'SK_ID_PREV', how = 'left')
previous_application.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_CREDIT | AMT_DOWN_PAYMENT | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | NFLAG_INSURED_ON_APPROVAL | CARD_MEAN_MONTHS_BALANCE | CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | CARD_MEAN_AMT_DRAWINGS_CURRENT | CARD_MEAN_AMT_INST_MIN_REGULARITY | CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | CARD_MEAN_AMT_TOTAL_RECEIVABLE | CARD_MEAN_CNT_DRAWINGS_CURRENT | CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | CARD_MEAN_SK_DPD | CARD_MEAN_SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 0.0 | SATURDAY | 15 | 1 | 0.0 | Approved | -73 | Cash through the bank | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 679671.0 | NaN | THURSDAY | 11 | 1 | NaN | Approved | -164 | XNA | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 136444.5 | NaN | TUESDAY | 11 | 1 | NaN | Approved | -301 | Cash through the bank | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 470790.0 | NaN | MONDAY | 7 | 1 | NaN | Approved | -512 | Cash through the bank | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 404055.0 | NaN | THURSDAY | 9 | 1 | NaN | Refused | -781 | Cash through the bank | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# Double check
previous_application[previous_application["SK_ID_PREV"] == 1000018]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_CREDIT | AMT_DOWN_PAYMENT | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | NFLAG_INSURED_ON_APPROVAL | CARD_MEAN_MONTHS_BALANCE | CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | CARD_MEAN_AMT_DRAWINGS_CURRENT | CARD_MEAN_AMT_INST_MIN_REGULARITY | CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | CARD_MEAN_AMT_TOTAL_RECEIVABLE | CARD_MEAN_CNT_DRAWINGS_CURRENT | CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | CARD_MEAN_SK_DPD | CARD_MEAN_SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1519242 | 1000018 | 394447 | Revolving loans | 2250.0 | 45000.0 | NaN | SUNDAY | 10 | 1 | NaN | Approved | -176 | XNA | Unaccompanied | New | XNA | Cards | walk-in | Country-wide | 595 | Consumer electronics | 0.0 | XNA | Card Street | 0.0 | -4.0 | 81000.0 | 29478.996 | 2594.088 | 5541.75 | 73602.585 | 8.8 | 2.0 | 0.0 | 0.0 |
def extract_mean(x):
y = x.groupby('SK_ID_PREV', as_index=False).mean().add_prefix('INSTALL_MEAN_')
return y
install_pay_mean = extract_mean(installments_payments)
install_pay_mean = install_pay_mean.rename(columns = {'INSTALL_MEAN_SK_ID_PREV' : 'SK_ID_PREV'})
install_pay_mean.head()
| SK_ID_PREV | INSTALL_MEAN_NUM_INSTALMENT_VERSION | INSTALL_MEAN_NUM_INSTALMENT_NUMBER | INSTALL_MEAN_DAYS_INSTALMENT | INSTALL_MEAN_DAYS_ENTRY_PAYMENT | INSTALL_MEAN_AMT_INSTALMENT | INSTALL_MEAN_AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|
| 0 | 1000001 | 1.500000 | 1.500000 | -253.000000 | -269.000000 | 34221.712500 | 34221.712500 |
| 1 | 1000002 | 1.250000 | 2.500000 | -1555.000000 | -1574.750000 | 9308.891250 | 9308.891250 |
| 2 | 1000003 | 1.000000 | 2.000000 | -64.000000 | -79.333333 | 4951.350000 | 4951.350000 |
| 3 | 1000004 | 1.142857 | 4.000000 | -772.000000 | -798.714286 | 4789.022143 | 4789.022143 |
| 4 | 1000005 | 1.000000 | 5.818182 | -1543.454545 | -1551.909091 | 14703.210000 | 13365.609545 |
previous_application = previous_application.merge(install_pay_mean, on = 'SK_ID_PREV', how = 'left')
previous_application.head(10)
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_CREDIT | AMT_DOWN_PAYMENT | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | NFLAG_INSURED_ON_APPROVAL | CARD_MEAN_MONTHS_BALANCE | CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | CARD_MEAN_AMT_DRAWINGS_CURRENT | CARD_MEAN_AMT_INST_MIN_REGULARITY | CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | CARD_MEAN_AMT_TOTAL_RECEIVABLE | CARD_MEAN_CNT_DRAWINGS_CURRENT | CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | CARD_MEAN_SK_DPD | CARD_MEAN_SK_DPD_DEF | INSTALL_MEAN_NUM_INSTALMENT_VERSION | INSTALL_MEAN_NUM_INSTALMENT_NUMBER | INSTALL_MEAN_DAYS_INSTALMENT | INSTALL_MEAN_DAYS_ENTRY_PAYMENT | INSTALL_MEAN_AMT_INSTALMENT | INSTALL_MEAN_AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 0.0 | SATURDAY | 15 | 1 | 0.0 | Approved | -73 | Cash through the bank | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.000000 | 1.0 | -42.0 | -42.000000 | 17284.275000 | 17284.275000 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 679671.0 | NaN | THURSDAY | 11 | 1 | NaN | Approved | -164 | XNA | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 3.0 | -74.0 | -83.200000 | 25188.615000 | 25188.615000 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 136444.5 | NaN | TUESDAY | 11 | 1 | NaN | Approved | -301 | Cash through the bank | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 5.0 | -151.0 | -159.222222 | 15060.735000 | 15060.735000 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 470790.0 | NaN | MONDAY | 7 | 1 | NaN | Approved | -512 | Cash through the bank | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.090909 | 6.0 | -332.0 | -339.090909 | 51193.943182 | 51193.943182 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 404055.0 | NaN | THURSDAY | 9 | 1 | NaN | Refused | -781 | Cash through the bank | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 1383531 | 199383 | Cash loans | 23703.930 | 340573.5 | NaN | SATURDAY | 8 | 1 | NaN | Approved | -684 | Cash through the bank | Family | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 18.0 | low_normal | Cash X-Sell: low | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.700000 | 9.6 | -396.0 | -405.100000 | 20966.645250 | 26522.376750 |
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | NaN | TUESDAY | 11 | 1 | NaN | Canceled | -14 | XNA | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 1656711 | 296299 | Cash loans | NaN | 0.0 | NaN | MONDAY | 7 | 1 | NaN | Canceled | -21 | XNA | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 2367563 | 342292 | Cash loans | NaN | 0.0 | NaN | MONDAY | 15 | 1 | NaN | Canceled | -386 | XNA | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 2579447 | 334349 | Cash loans | NaN | 0.0 | NaN | SATURDAY | 15 | 1 | NaN | Canceled | -57 | XNA | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
def extract_mean(x):
y = x.groupby('SK_ID_PREV', as_index=False).mean().add_prefix('POS_MEAN_')
return y
POS_mean = extract_mean(POS_CASH_balance)
POS_mean = POS_mean.rename(columns = {'POS_MEAN_SK_ID_PREV' : 'SK_ID_PREV'})
POS_mean.head()
| SK_ID_PREV | POS_MEAN_MONTHS_BALANCE | POS_MEAN_CNT_INSTALMENT | POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|
| 0 | 1000001 | -9.0 | 8.666667 | 7.666667 |
| 1 | 1000002 | -52.0 | 5.200000 | 2.000000 |
| 2 | 1000003 | -2.5 | 12.000000 | 10.500000 |
| 3 | 1000004 | -25.5 | 9.625000 | 6.125000 |
| 4 | 1000005 | -51.0 | 10.000000 | 5.000000 |
previous_application = previous_application.merge(POS_mean, on = 'SK_ID_PREV', how = 'left')
previous_application.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_CREDIT | AMT_DOWN_PAYMENT | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | NFLAG_INSURED_ON_APPROVAL | CARD_MEAN_MONTHS_BALANCE | CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | CARD_MEAN_AMT_DRAWINGS_CURRENT | CARD_MEAN_AMT_INST_MIN_REGULARITY | CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | CARD_MEAN_AMT_TOTAL_RECEIVABLE | CARD_MEAN_CNT_DRAWINGS_CURRENT | CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | CARD_MEAN_SK_DPD | CARD_MEAN_SK_DPD_DEF | INSTALL_MEAN_NUM_INSTALMENT_VERSION | INSTALL_MEAN_NUM_INSTALMENT_NUMBER | INSTALL_MEAN_DAYS_INSTALMENT | INSTALL_MEAN_DAYS_ENTRY_PAYMENT | INSTALL_MEAN_AMT_INSTALMENT | INSTALL_MEAN_AMT_PAYMENT | POS_MEAN_MONTHS_BALANCE | POS_MEAN_CNT_INSTALMENT | POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 0.0 | SATURDAY | 15 | 1 | 0.0 | Approved | -73 | Cash through the bank | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.000000 | 1.0 | -42.0 | -42.000000 | 17284.275000 | 17284.275000 | -1.5 | 6.500000 | 6.000000 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 679671.0 | NaN | THURSDAY | 11 | 1 | NaN | Approved | -164 | XNA | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 3.0 | -74.0 | -83.200000 | 25188.615000 | 25188.615000 | -4.0 | 36.000000 | 34.000000 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 136444.5 | NaN | TUESDAY | 11 | 1 | NaN | Approved | -301 | Cash through the bank | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 5.0 | -151.0 | -159.222222 | 15060.735000 | 15060.735000 | -5.5 | 12.000000 | 7.500000 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 470790.0 | NaN | MONDAY | 7 | 1 | NaN | Approved | -512 | Cash through the bank | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.090909 | 6.0 | -332.0 | -339.090909 | 51193.943182 | 51193.943182 | -11.5 | 11.916667 | 6.416667 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 404055.0 | NaN | THURSDAY | 9 | 1 | NaN | Refused | -781 | Cash through the bank | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
def extract_mean(x):
y = x.groupby('SK_ID_CURR', as_index=False).mean().add_prefix('PREV_APPL_MEAN_')
return y
prev_appl_mean = extract_mean(previous_application)
prev_appl_mean = prev_appl_mean.rename(columns = {'PREV_APPL_MEAN_SK_ID_CURR' : 'SK_ID_CURR'})
prev_appl_mean.head()
| SK_ID_CURR | PREV_APPL_MEAN_SK_ID_PREV | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 1.369693e+06 | 3951.000 | 23787.00 | 2520.0 | 13.000000 | 1.0 | 0.104326 | -1740.0 | 23.0 | 8.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.250000 | 2.500000 | -1664.000000 | -1679.500000 | 7312.725000 | 7312.725000 | -55.000000 | 4.000000 | 2.000000 |
| 1 | 100002 | 1.038818e+06 | 9251.775 | 179055.00 | 0.0 | 9.000000 | 1.0 | 0.000000 | -606.0 | 500.0 | 24.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.052632 | 10.000000 | -295.000000 | -315.421053 | 11559.247105 | 11559.247105 | -10.000000 | 24.000000 | 15.000000 |
| 2 | 100003 | 2.281150e+06 | 56553.990 | 484191.00 | 3442.5 | 14.666667 | 1.0 | 0.050030 | -1305.0 | 533.0 | 10.0 | 0.666667 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.047619 | 4.666667 | -1164.333333 | -1171.781746 | 78558.479286 | 78558.479286 | -39.166667 | 9.791667 | 5.666667 |
| 3 | 100004 | 1.564014e+06 | 5357.250 | 20106.00 | 4860.0 | 5.000000 | 1.0 | 0.212008 | -815.0 | 30.0 | 4.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.333333 | 2.000000 | -754.000000 | -761.666667 | 7096.155000 | 7096.155000 | -25.500000 | 3.750000 | 2.250000 |
| 4 | 100005 | 2.176837e+06 | 4813.200 | 20076.75 | 4464.0 | 10.500000 | 1.0 | 0.108964 | -536.0 | 18.0 | 12.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.111111 | 5.000000 | -586.000000 | -609.555556 | 6240.205000 | 6240.205000 | -20.000000 | 11.700000 | 7.200000 |
prev_appl_mean = prev_appl_mean.drop('PREV_APPL_MEAN_SK_ID_PREV', axis = 1) # we don't need this intermediate column any more
prev_appl_mean.head()
| SK_ID_CURR | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 23787.00 | 2520.0 | 13.000000 | 1.0 | 0.104326 | -1740.0 | 23.0 | 8.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.250000 | 2.500000 | -1664.000000 | -1679.500000 | 7312.725000 | 7312.725000 | -55.000000 | 4.000000 | 2.000000 |
| 1 | 100002 | 9251.775 | 179055.00 | 0.0 | 9.000000 | 1.0 | 0.000000 | -606.0 | 500.0 | 24.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.052632 | 10.000000 | -295.000000 | -315.421053 | 11559.247105 | 11559.247105 | -10.000000 | 24.000000 | 15.000000 |
| 2 | 100003 | 56553.990 | 484191.00 | 3442.5 | 14.666667 | 1.0 | 0.050030 | -1305.0 | 533.0 | 10.0 | 0.666667 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.047619 | 4.666667 | -1164.333333 | -1171.781746 | 78558.479286 | 78558.479286 | -39.166667 | 9.791667 | 5.666667 |
| 3 | 100004 | 5357.250 | 20106.00 | 4860.0 | 5.000000 | 1.0 | 0.212008 | -815.0 | 30.0 | 4.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.333333 | 2.000000 | -754.000000 | -761.666667 | 7096.155000 | 7096.155000 | -25.500000 | 3.750000 | 2.250000 |
| 4 | 100005 | 4813.200 | 20076.75 | 4464.0 | 10.500000 | 1.0 | 0.108964 | -536.0 | 18.0 | 12.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.111111 | 5.000000 | -586.000000 | -609.555556 | 6240.205000 | 6240.205000 | -20.000000 | 11.700000 | 7.200000 |
print('data shape', data.shape)
print('previous applications statistics shape', prev_appl_mean.shape)
data shape (356255, 83) previous applications statistics shape (338857, 30)
data = data.merge(prev_appl_mean, on = 'SK_ID_CURR', how = 'left')
data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1.0 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -3648.0 | -2120 | NaN | 0 | 1 | 0 | Laborers | 1.0 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | -874.00 | 0.0 | -349.0 | -697.500000 | 1681.029 | 0.0 | 108131.945625 | 49156.2 | 7997.14125 | 0.0 | -499.875 | 0.0 | -21.875 | 1.0 | 9251.775 | 179055.00 | 0.00 | 9.000000 | 1.0 | 0.000000 | -606.000000 | 500.000000 | 24.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.052632 | 10.000000 | -295.000000 | -315.421053 | 11559.247105 | 11559.247105 | -10.000000 | 24.000000 | 15.000000 |
| 1 | 100003 | 0.0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1186.0 | -291 | NaN | 0 | 1 | 0 | Core staff | 2.0 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | reg oper account | block of flats | Block | No | 0.0 | 1.0 | 0.0 | -828.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | -1400.75 | 0.0 | -544.5 | -1097.333333 | 0.000 | 0.0 | 254350.125000 | 0.0 | 202500.00000 | 0.0 | -816.000 | NaN | NaN | 3.0 | 56553.990 | 484191.00 | 3442.50 | 14.666667 | 1.0 | 0.050030 | -1305.000000 | 533.000000 | 10.000000 | 0.666667 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.047619 | 4.666667 | -1164.333333 | -1171.781746 | 78558.479286 | 78558.479286 | -39.166667 | 9.791667 | 5.666667 |
| 2 | 100004 | 0.0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -4260.0 | -2531 | 26.0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | -867.00 | 0.0 | -488.5 | -532.500000 | 0.000 | 0.0 | 94518.900000 | 0.0 | 0.00000 | 0.0 | -532.000 | NaN | NaN | 1.0 | 5357.250 | 20106.00 | 4860.00 | 5.000000 | 1.0 | 0.212008 | -815.000000 | 30.000000 | 4.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.333333 | 2.000000 | -754.000000 | -761.666667 | 7096.155000 | 7096.155000 | -25.500000 | 3.750000 | 2.250000 |
| 3 | 100006 | 0.0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -9833.0 | -2437 | NaN | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 2.0 | 0.0 | -617.0 | 1 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9.0 | 23651.175 | 291695.50 | 34840.17 | 14.666667 | 1.0 | 0.163412 | -272.444444 | 894.222222 | 23.000000 | 0.000000 | -3.5 | 270000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.400000 | 3.166667 | -260.666667 | -285.966667 | 241944.225000 | 241944.225000 | -9.000000 | 12.888889 | 10.214286 |
| 4 | 100007 | 0.0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -4311.0 | -3458 | NaN | 0 | 0 | 0 | Core staff | 1.0 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1149.00 | 0.0 | -783.0 | -783.000000 | 0.000 | 0.0 | 146250.000000 | 0.0 | 0.00000 | 0.0 | -783.000 | NaN | NaN | 6.0 | 12278.805 | 166638.75 | 3390.75 | 12.333333 | 1.0 | 0.159516 | -1222.833333 | 409.166667 | 20.666667 | 0.600000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.129412 | 6.843956 | -1087.881319 | -1090.768539 | 12122.995738 | 11671.540210 | -36.100000 | 15.066667 | 8.966667 |
print('data shape', data.shape)
data shape (356255, 112)
As we can see, this last sprint over previous applications added 32 new features to our statistics and completed the unification of all data
(We use Vietnamese to fully explain how to combine all dataset)
Việc combine toàn bộ dataset được thực hiện như sau:
Trước tiên là ghép application_train và application_test lại ta được dataframe tổng tên là: data
Tiếp theo, việc combine dataset được chia thành 2 nhánh: Nhánh trái và Nhánh phải
Dù ở nhánh nào thì ý tưởng chung đều được tóm gọn trong câu sau: 'Muốn cắm bảng B vào bảng A bằng key K (được gợi ý sẵn trên graph) thì phải tạo ra key K tồn tại unique ở bảng B.'
Nhánh trái:
Nhánh phải:
Lưu ý 1:
Lưu ý 2: Tiền tố có trong các cột sẽ giúp ta nhận biết cột đó đến từ bảng nào:
Perform split according to IDs in initial train and test datasets
train = data[data['SK_ID_CURR'].isin(application_train.SK_ID_CURR)]
test = data[data.SK_ID_CURR.isin(application_test.SK_ID_CURR)]
test.drop('TARGET', axis = 1, inplace = True)
print("Initial train set", application_train.shape)
print("Initial test set", application_test.shape)
print('Training Features shape with categorical columns: ', train.shape)
print('Testing Features shape with categorical columns: ', test.shape)
Initial train set (307511, 68) Initial test set (48744, 67) Training Features shape with categorical columns: (307511, 112) Testing Features shape with categorical columns: (48744, 111)
Thấy số dòng bằng nhau nên đã đúng, nhưng kiểm tra thêm cho chắc: 10001 với 10005 ở tập test, và giờ train set sẽ ko có giá trị TARGET nào bị null
test[test["SK_ID_CURR"] == 100001]
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307511 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.01885 | -19241 | -5170.0 | -812 | NaN | 0 | 0 | 1 | NaN | 2.0 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.15952 | 0.066 | 0.059 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | NaN | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -1740.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | -735.0 | 0.0 | 82.428571 | -825.5 | NaN | 0.0 | 207623.571429 | 85240.928571 | 0.0 | 0.0 | -93.142857 | 3545.357143 | -11.785714 | 1.0 | 3951.0 | 23787.0 | 2520.0 | 13.0 | 1.0 | 0.104326 | -1740.0 | 23.0 | 8.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.25 | 2.5 | -1664.0 | -1679.5 | 7312.725 | 7312.725 | -55.0 | 4.0 | 2.0 |
test[test["SK_ID_CURR"] == 100005]
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307512 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -9118.0 | -1623 | NaN | 0 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.56499 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | -190.666667 | 0.0 | 439.333333 | -123.0 | 0.0 | 0.0 | 219042.0 | 189469.5 | 0.0 | 0.0 | -54.333333 | 1420.5 | -3.0 | 2.0 | 4813.2 | 20076.75 | 4464.0 | 10.5 | 1.0 | 0.108964 | -536.0 | 18.0 | 12.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.111111 | 5.0 | -586.0 | -609.555556 | 6240.205 | 6240.205 | -20.0 | 11.7 | 7.2 |
train[train["SK_ID_CURR"] == 100001]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE |
|---|
train[train["SK_ID_CURR"] == 100005]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE |
|---|
train.shape
(307511, 112)
train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 307511 entries, 0 to 307510 Columns: 112 entries, SK_ID_CURR to PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE dtypes: float64(79), int64(17), object(16) memory usage: 265.1+ MB
missing_values_table(train)
Your selected dataframe has 112 columns. There are 80 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | 229577 | 74.7 |
| PREV_APPL_MEAN_CARD_MEAN_SK_DPD | 229577 | 74.7 |
| PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | 229577 | 74.7 |
| PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | 229577 | 74.7 |
| PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | 229577 | 74.7 |
| ... | ... | ... |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.3 |
| EXT_SOURCE_2 | 660 | 0.2 |
| AMT_ANNUITY | 12 | 0.0 |
| CNT_FAM_MEMBERS | 2 | 0.0 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.0 |
80 rows × 2 columns
train.select_dtypes(include=['category', "object"])
| NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | WEDNESDAY | Business Entity Type 3 | reg oper account | block of flats | Stone, brick | No |
| 1 | Cash loans | F | N | N | Family | State servant | Higher education | Married | House / apartment | Core staff | MONDAY | School | reg oper account | block of flats | Block | No |
| 2 | Revolving loans | M | Y | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | MONDAY | Government | NaN | NaN | NaN | NaN |
| 3 | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | Laborers | WEDNESDAY | Business Entity Type 3 | NaN | NaN | NaN | NaN |
| 4 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Core staff | THURSDAY | Religion | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | Cash loans | M | N | N | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | Sales staff | THURSDAY | Services | reg oper account | block of flats | Stone, brick | No |
| 307507 | Cash loans | F | N | Y | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | NaN | MONDAY | XNA | reg oper account | block of flats | Stone, brick | No |
| 307508 | Cash loans | F | N | Y | Unaccompanied | Working | Higher education | Separated | House / apartment | Managers | THURSDAY | School | reg oper account | block of flats | Panel | No |
| 307509 | Cash loans | F | N | Y | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | Laborers | WEDNESDAY | Business Entity Type 1 | NaN | block of flats | Stone, brick | No |
| 307510 | Cash loans | F | N | N | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | Laborers | THURSDAY | Business Entity Type 3 | NaN | block of flats | Panel | No |
307511 rows × 16 columns
Everything is touching-good-ly good now!
print('Missing values in initially train data: ', sum(train.isnull().sum()))
print('Missing values in initially test data: ', sum(test.isnull().sum()))
Missing values in initially train data: 7769136 Missing values in initially test data: 1092343
Fill missing values in train dataset
# Fill missing values in train data with 2 types 'int64', "float64" by mean of each column
train_type_int_float = train.select_dtypes(include=['int64', "float64"])
train.fillna(train_type_int_float.mean(), inplace=True)
# Check whether there is a 'category' column or not
train.select_dtypes(include=['category'])
| 0 |
|---|
| 1 |
| 2 |
| 3 |
| 4 |
| ... |
| 307506 |
| 307507 |
| 307508 |
| 307509 |
| 307510 |
307511 rows × 0 columns
So, there is no column has 'category' type in "train" dataset
# Fill missing values in train data with "object" type by value 'Unknown'
list_obj = train.select_dtypes(include=['object']).columns.to_list()
temp = train[list_obj].fillna('Unknown')
train[list_obj] = temp
train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1.0 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -3648.0 | -2120 | 12.061091 | 0 | 1 | 0 | Laborers | 1.0 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.02470 | 0.036900 | 0.972200 | 0.619200 | 0.014300 | 0.000000 | 0.069000 | 0.083300 | 0.125000 | 0.036900 | 0.020200 | 0.019000 | 0.000000 | 0.000000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 8.000000 | -874.00000 | 0.000000 | -349.000000 | -697.500000 | 1681.029000 | 0.000000 | 108131.945625 | 49156.200000 | 7997.141250 | 0.000000 | -499.875000 | 0.00000 | -21.875000 | 1.0 | 9251.775 | 179055.00 | 0.00 | 9.000000 | 1.0 | 0.000000 | -606.000000 | 500.000000 | 24.000000 | 0.000000 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.052632 | 10.000000 | -295.000000 | -315.421053 | 11559.247105 | 11559.247105 | -10.000000 | 24.000000 | 15.000000 |
| 1 | 100003 | 0.0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1186.0 | -291 | 12.061091 | 0 | 1 | 0 | Core staff | 2.0 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | 0.510853 | 0.09590 | 0.052900 | 0.985100 | 0.796000 | 0.060500 | 0.080000 | 0.034500 | 0.291700 | 0.333300 | 0.013000 | 0.077300 | 0.054900 | 0.003900 | 0.009800 | reg oper account | block of flats | Block | No | 0.0 | 1.0 | 0.0 | -828.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 | -1400.75000 | 0.000000 | -544.500000 | -1097.333333 | 0.000000 | 0.000000 | 254350.125000 | 0.000000 | 202500.000000 | 0.000000 | -816.000000 | 16052.24733 | -20.984805 | 3.0 | 56553.990 | 484191.00 | 3442.50 | 14.666667 | 1.0 | 0.050030 | -1305.000000 | 533.000000 | 10.000000 | 0.666667 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.047619 | 4.666667 | -1164.333333 | -1171.781746 | 78558.479286 | 78558.479286 | -39.166667 | 9.791667 | 5.666667 |
| 2 | 100004 | 0.0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -4260.0 | -2531 | 26.000000 | 1 | 1 | 0 | Laborers | 1.0 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | Government | 0.502130 | 0.555912 | 0.729567 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | -867.00000 | 0.000000 | -488.500000 | -532.500000 | 0.000000 | 0.000000 | 94518.900000 | 0.000000 | 0.000000 | 0.000000 | -532.000000 | 16052.24733 | -20.984805 | 1.0 | 5357.250 | 20106.00 | 4860.00 | 5.000000 | 1.0 | 0.212008 | -815.000000 | 30.000000 | 4.000000 | 0.000000 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.333333 | 2.000000 | -754.000000 | -761.666667 | 7096.155000 | 7096.155000 | -25.500000 | 3.750000 | 2.250000 |
| 3 | 100006 | 0.0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -9833.0 | -2437 | 12.061091 | 0 | 0 | 0 | Laborers | 2.0 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.502130 | 0.650442 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 2.0 | 0.0 | -617.0 | 1 | 0 | 0 | 0.006402 | 0.007 | 0.034362 | 0.267395 | 0.265474 | 1.899974 | 5.561196 | -1083.04711 | 1.035863 | 651.807511 | -970.304531 | 5242.425046 | 0.007919 | 378080.200789 | 160390.076973 | 5901.475578 | 49.549302 | -546.632499 | 16052.24733 | -20.984805 | 9.0 | 23651.175 | 291695.50 | 34840.17 | 14.666667 | 1.0 | 0.163412 | -272.444444 | 894.222222 | 23.000000 | 0.000000 | -3.50000 | 270000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.400000 | 3.166667 | -260.666667 | -285.966667 | 241944.225000 | 241944.225000 | -9.000000 | 12.888889 | 10.214286 |
| 4 | 100007 | 0.0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -4311.0 | -3458 | 12.061091 | 0 | 0 | 0 | Core staff | 1.0 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 1 | 1 | Religion | 0.502130 | 0.322738 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 1 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | -1149.00000 | 0.000000 | -783.000000 | -783.000000 | 0.000000 | 0.000000 | 146250.000000 | 0.000000 | 0.000000 | 0.000000 | -783.000000 | 16052.24733 | -20.984805 | 6.0 | 12278.805 | 166638.75 | 3390.75 | 12.333333 | 1.0 | 0.159516 | -1222.833333 | 409.166667 | 20.666667 | 0.600000 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.129412 | 6.843956 | -1087.881319 | -1090.768539 | 12122.995738 | 11671.540210 | -36.100000 | 15.066667 | 8.966667 |
train.shape
(307511, 112)
Fill missing values in 'test' dataset
# Fill missing values in test data with 2 types 'int64', "float64" by mean of each column
test_type_int_float = train.select_dtypes(include=['int64', "float64"])
test.fillna(test_type_int_float.mean(), inplace=True)
# Check whether there is a 'category' column or not
test.select_dtypes(include=['category'])
| 307511 |
|---|
| 307512 |
| 307513 |
| 307514 |
| 307515 |
| ... |
| 356250 |
| 356251 |
| 356252 |
| 356253 |
| 356254 |
48744 rows × 0 columns
So, there is no column has 'category' type in "test" dataset
# Fill missing values in test data with "object" type by value 'Unknown'
list_obj = test.select_dtypes(include=['object']).columns.to_list()
temp = test[list_obj].fillna('Unknown')
test[list_obj] = temp
test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307511 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -5170.0 | -812 | 12.061091 | 0 | 0 | 1 | Unknown | 2.0 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.06600 | 0.059000 | 0.973200 | 0.752471 | 0.044621 | 0.078942 | 0.137900 | 0.125000 | 0.231894 | 0.066333 | 0.100775 | 0.050500 | 0.008809 | 0.028358 | Unknown | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -1740.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | -735.000000 | 0.000000 | 82.428571 | -825.500000 | 5242.425046 | 0.000000 | 207623.571429 | 85240.928571 | 0.000000 | 0.000000 | -93.142857 | 3545.357143 | -11.785714 | 1.0 | 3951.000 | 23787.000 | 2520.0 | 13.0 | 1.0 | 0.104326 | -1740.0 | 23.0 | 8.000000 | 0.000000 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.250000 | 2.500000 | -1664.000000 | -1679.500000 | 7312.725000 | 7312.725000 | -55.000000 | 4.000000 | 2.000000 |
| 307512 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -9118.0 | -1623 | 12.061091 | 0 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 3.000000 | -190.666667 | 0.000000 | 439.333333 | -123.000000 | 0.000000 | 0.000000 | 219042.000000 | 189469.500000 | 0.000000 | 0.000000 | -54.333333 | 1420.500000 | -3.000000 | 2.0 | 4813.200 | 20076.750 | 4464.0 | 10.5 | 1.0 | 0.108964 | -536.0 | 18.0 | 12.000000 | 0.000000 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.111111 | 5.000000 | -586.000000 | -609.555556 | 6240.205000 | 6240.205000 | -20.000000 | 11.700000 | 7.200000 |
| 307513 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | Unknown | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -2175.0 | -3503 | 5.000000 | 0 | 0 | 0 | Drivers | 2.0 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | 0.502130 | 0.699787 | 0.610991 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 1 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 1.000000 | 4.000000 | 4.000000 | -1737.500000 | 0.000000 | -1068.000000 | -1054.750000 | 19305.000000 | 0.000000 | 518070.015000 | 0.000000 | 5901.475578 | 0.000000 | -775.500000 | 0.000000 | -28.250000 | 4.0 | 11478.195 | 146134.125 | 3375.0 | 14.5 | 1.0 | 0.067217 | -837.5 | 82.0 | 17.333333 | 0.333333 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.050926 | 6.027778 | -854.833333 | -867.592593 | 16349.077917 | 13702.794792 | -28.833333 | 16.648148 | 11.451178 |
| 307514 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -2000.0 | -4208 | 12.061091 | 0 | 1 | 0 | Sales staff | 4.0 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.30520 | 0.197400 | 0.997000 | 0.959200 | 0.116500 | 0.320000 | 0.275900 | 0.375000 | 0.041700 | 0.204200 | 0.240400 | 0.367300 | 0.038600 | 0.080000 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -1805.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 12.000000 | -1401.750000 | 0.000000 | 2387.700000 | -1238.285714 | 0.000000 | 0.000000 | 126739.590000 | 18630.450000 | 14484.394286 | 0.000000 | -651.500000 | 3012.010714 | -22.833333 | 5.0 | 8091.585 | 92920.500 | 3750.0 | 10.8 | 1.0 | 0.057708 | -1124.2 | 1409.6 | 11.333333 | 0.000000 | -25.00000 | 225000.00000 | 6156.400408 | 6133.363929 | 5606.152347 | 7968.609184 | 2.387755 | 19.547619 | 0.000000 | 0.000000 | 1.038889 | 17.595238 | -944.964286 | -949.814286 | 7836.897982 | 7557.738339 | -35.250000 | 15.928571 | 8.312500 |
| 307515 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -4000.0 | -4262 | 16.000000 | 1 | 0 | 0 | Unknown | 3.0 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -821.0 | 1 | 0 | 0 | 0.006402 | 0.007 | 0.034362 | 0.267395 | 0.265474 | 1.899974 | 5.561196 | -1083.047110 | 1.035863 | 651.807511 | -970.304531 | 5242.425046 | 0.007919 | 378080.200789 | 160390.076973 | 5901.475578 | 49.549302 | -546.632499 | 16052.247330 | -20.984805 | 2.0 | 17782.155 | 300550.500 | 8095.5 | 5.5 | 1.0 | 0.087554 | -466.0 | 13.0 | 24.000000 | 0.000000 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.000000 | 6.500000 | -622.000000 | -634.250000 | 11100.337500 | 11100.337500 | -21.000000 | 12.000000 | 5.846154 |
print('Missing values in train data after filling: ', sum(train.isnull().sum()))
print('Missing values in test data after filling: ', sum(test.isnull().sum()))
Missing values in train data after filling: 0 Missing values in test data after filling: 0
So, no more missing values in both "train" and "test" dataset
numeric_col_train = train.select_dtypes(include=['int64', "float64"]).columns.to_list()
# Using boxplot to see outliers of all features
for feature in numeric_col_train:
fig, ax = plt.subplots(1, 1, figsize = (10, 7))
plt.boxplot(train[feature].dropna(), patch_artist =True, vert = False)
ax.set_title("Boxplot of the feature: " + feature)
We can see some insights:
There are a lot of columns which have a lot of duplicated values.
# Remove these columns from variable 'numeric_col_train'
list_to_remove = ["SK_ID_CURR", 'TARGET', 'CNT_CHILDREN', 'FLAG_WORK_PHONE', 'REGION_RATING_CLIENT_W_CITY',
'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_8',
'AMT_REQ_CREDIT_BUREAU_HOUR']
final_list = list(set(numeric_col_train) - set(list_to_remove))
# We drop outliers with Q1-5%, Q3-95%
for x in final_list:
q95,q5 = np.percentile(train.loc[:,x],[95,5])
intr_qr = q95-q5
max = q95+(1.5*intr_qr)
min = q5-(1.5*intr_qr)
train.loc[train[x] < min,x] = np.nan
train.loc[train[x] > max,x] = np.nan
train.dropna(axis = 0, inplace = True)
# Size of train dataset after removing outliers
train.shape
(218646, 112)
# Using boxplot to see outliers of all features after we drop outliers with Q1-5%, Q3-95%
for feature in numeric_col_train:
fig, ax = plt.subplots(1, 1, figsize = (10, 7))
plt.boxplot(train[feature].dropna(), patch_artist =True, vert = False)
ax.set_title("Boxplot of the feature: " + feature)
Train dataset after preprocessing data
train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1.0 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461.0 | -3648.0 | -2120.0 | 12.061091 | 0 | 1.0 | 0.0 | Laborers | 1.0 | 2 | WEDNESDAY | 10.0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.02470 | 0.036900 | 0.972200 | 0.619200 | 0.014300 | 0.000000 | 0.069000 | 0.083300 | 0.125000 | 0.036900 | 0.020200 | 0.019000 | 0.000000 | 0.000000 | reg oper account | block of flats | Stone, brick | No | 2.0 | 2.0 | 2.0 | -1134.0 | 1.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | -874.000000 | 0.0 | -349.000000 | -697.5 | 1681.029 | 0.0 | 108131.945625 | 49156.200000 | 7997.14125 | 0.0 | -499.875000 | 0.00000 | -21.875000 | 1.0 | 9251.775000 | 179055.000000 | 0.000000 | 9.000000 | 1.0 | 0.000000 | -606.000000 | 500.000000 | 24.000000 | 0.00 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.052632 | 10.000000 | -295.000000 | -315.421053 | 11559.247105 | 11559.247105 | -10.000000 | 24.000000 | 15.000000 |
| 2 | 100004 | 0.0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046.0 | -4260.0 | -2531.0 | 26.000000 | 1 | 1.0 | 0.0 | Laborers | 1.0 | 2 | MONDAY | 9.0 | 0 | 0 | 0 | 0 | 0 | Government | 0.502130 | 0.555912 | 0.729567 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -815.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | -867.000000 | 0.0 | -488.500000 | -532.5 | 0.000 | 0.0 | 94518.900000 | 0.000000 | 0.00000 | 0.0 | -532.000000 | 16052.24733 | -20.984805 | 1.0 | 5357.250000 | 20106.000000 | 4860.000000 | 5.000000 | 1.0 | 0.212008 | -815.000000 | 30.000000 | 4.000000 | 0.00 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.333333 | 2.000000 | -754.000000 | -761.666667 | 7096.155000 | 7096.155000 | -25.500000 | 3.750000 | 2.250000 |
| 4 | 100007 | 0.0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932.0 | -4311.0 | -3458.0 | 12.061091 | 0 | 0.0 | 0.0 | Core staff | 1.0 | 2 | THURSDAY | 11.0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.502130 | 0.322738 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -1106.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | -1149.000000 | 0.0 | -783.000000 | -783.0 | 0.000 | 0.0 | 146250.000000 | 0.000000 | 0.00000 | 0.0 | -783.000000 | 16052.24733 | -20.984805 | 6.0 | 12278.805000 | 166638.750000 | 3390.750000 | 12.333333 | 1.0 | 0.159516 | -1222.833333 | 409.166667 | 20.666667 | 0.60 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.129412 | 6.843956 | -1087.881319 | -1090.768539 | 12122.995738 | 11671.540210 | -36.100000 | 15.066667 | 8.966667 |
| 5 | 100008 | 0.0 | Cash loans | M | N | Y | 0 | 99000.0 | 490495.5 | 27517.5 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | -16941.0 | -4970.0 | -477.0 | 12.061091 | 1 | 1.0 | 0.0 | Laborers | 2.0 | 2 | WEDNESDAY | 16.0 | 0 | 0 | 0 | 0 | 0 | Other | 0.502130 | 0.354225 | 0.621226 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -2536.0 | 1.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | -757.333333 | 0.0 | -391.333333 | -909.0 | 0.000 | 0.0 | 156148.500000 | 80019.000000 | 0.00000 | 0.0 | -611.000000 | 16052.24733 | -20.984805 | 5.0 | 15839.696250 | 162767.700000 | 5548.500000 | 12.000000 | 1.0 | 0.073051 | -1192.000000 | 73.000000 | 14.000000 | 0.25 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.031250 | 4.852273 | -1318.931818 | -1301.861742 | 28547.512398 | 28275.099784 | -38.625000 | 13.388889 | 8.050821 |
| 6 | 100009 | 0.0 | Cash loans | F | Y | Y | 1 | 171000.0 | 1560726.0 | 41301.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.035792 | -13778.0 | -1213.0 | -619.0 | 17.000000 | 0 | 1.0 | 0.0 | Accountants | 3.0 | 2 | SUNDAY | 16.0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.774761 | 0.724000 | 0.492060 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 1.0 | 0.0 | -1562.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 18.0 | -1271.500000 | 0.0 | -794.937500 | -1108.5 | 0.000 | 0.0 | 266711.750000 | 76953.535714 | 0.00000 | 0.0 | -851.611111 | 16052.24733 | -20.984805 | 7.0 | 10051.412143 | 70137.642857 | 9203.142857 | 13.714286 | 1.0 | 0.126602 | -719.285714 | 170.000000 | 8.000000 | 0.00 | -16.05068 | 222543.309831 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.9839 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.000000 | 3.857143 | -602.571429 | -612.245238 | 10050.262714 | 10050.262714 | -19.928571 | 8.000000 | 4.598901 |
Test dataset after preprocessing data
test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_8 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | PREVIOUS_LOANS_COUNT | PREV_BUR_MEAN_DAYS_CREDIT | PREV_BUR_MEAN_CREDIT_DAY_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE | PREV_BUR_MEAN_DAYS_ENDDATE_FACT | PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE | PREV_BUR_MEAN_CNT_CREDIT_PROLONG | PREV_BUR_MEAN_AMT_CREDIT_SUM | PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT | PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT | PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE | PREV_BUR_MEAN_DAYS_CREDIT_UPDATE | PREV_BUR_MEAN_AMT_ANNUITY | PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE | PREVIOUS_APPLICATION_COUNT | PREV_APPL_MEAN_AMT_ANNUITY | PREV_APPL_MEAN_AMT_CREDIT | PREV_APPL_MEAN_AMT_DOWN_PAYMENT | PREV_APPL_MEAN_HOUR_APPR_PROCESS_START | PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY | PREV_APPL_MEAN_RATE_DOWN_PAYMENT | PREV_APPL_MEAN_DAYS_DECISION | PREV_APPL_MEAN_SELLERPLACE_AREA | PREV_APPL_MEAN_CNT_PAYMENT | PREV_APPL_MEAN_NFLAG_INSURED_ON_APPROVAL | PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL | PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY | PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT | PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE | PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT | PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM | PREV_APPL_MEAN_CARD_MEAN_SK_DPD | PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_VERSION | PREV_APPL_MEAN_INSTALL_MEAN_NUM_INSTALMENT_NUMBER | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT | PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT | PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT | PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307511 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -5170.0 | -812 | 12.061091 | 0 | 0 | 1 | Unknown | 2.0 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.06600 | 0.059000 | 0.973200 | 0.752471 | 0.044621 | 0.078942 | 0.137900 | 0.125000 | 0.231894 | 0.066333 | 0.100775 | 0.050500 | 0.008809 | 0.028358 | Unknown | block of flats | Stone, brick | No | 0.0 | 0.0 | 0.0 | -1740.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | -735.000000 | 0.000000 | 82.428571 | -825.500000 | 5242.425046 | 0.000000 | 207623.571429 | 85240.928571 | 0.000000 | 0.000000 | -93.142857 | 3545.357143 | -11.785714 | 1.0 | 3951.000 | 23787.000 | 2520.0 | 13.0 | 1.0 | 0.104326 | -1740.0 | 23.0 | 8.000000 | 0.000000 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.250000 | 2.500000 | -1664.000000 | -1679.500000 | 7312.725000 | 7312.725000 | -55.000000 | 4.000000 | 2.000000 |
| 307512 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -9118.0 | -1623 | 12.061091 | 0 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 3.000000 | -190.666667 | 0.000000 | 439.333333 | -123.000000 | 0.000000 | 0.000000 | 219042.000000 | 189469.500000 | 0.000000 | 0.000000 | -54.333333 | 1420.500000 | -3.000000 | 2.0 | 4813.200 | 20076.750 | 4464.0 | 10.5 | 1.0 | 0.108964 | -536.0 | 18.0 | 12.000000 | 0.000000 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.111111 | 5.000000 | -586.000000 | -609.555556 | 6240.205000 | 6240.205000 | -20.000000 | 11.700000 | 7.200000 |
| 307513 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | Unknown | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -2175.0 | -3503 | 5.000000 | 0 | 0 | 0 | Drivers | 2.0 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | 0.502130 | 0.699787 | 0.610991 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 1 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 1.000000 | 4.000000 | 4.000000 | -1737.500000 | 0.000000 | -1068.000000 | -1054.750000 | 19305.000000 | 0.000000 | 518070.015000 | 0.000000 | 5901.475578 | 0.000000 | -775.500000 | 0.000000 | -28.250000 | 4.0 | 11478.195 | 146134.125 | 3375.0 | 14.5 | 1.0 | 0.067217 | -837.5 | 82.0 | 17.333333 | 0.333333 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.050926 | 6.027778 | -854.833333 | -867.592593 | 16349.077917 | 13702.794792 | -28.833333 | 16.648148 | 11.451178 |
| 307514 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -2000.0 | -4208 | 12.061091 | 0 | 1 | 0 | Sales staff | 4.0 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.30520 | 0.197400 | 0.997000 | 0.959200 | 0.116500 | 0.320000 | 0.275900 | 0.375000 | 0.041700 | 0.204200 | 0.240400 | 0.367300 | 0.038600 | 0.080000 | reg oper account | block of flats | Panel | No | 0.0 | 0.0 | 0.0 | -1805.0 | 1 | 0 | 0 | 0.000000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 12.000000 | -1401.750000 | 0.000000 | 2387.700000 | -1238.285714 | 0.000000 | 0.000000 | 126739.590000 | 18630.450000 | 14484.394286 | 0.000000 | -651.500000 | 3012.010714 | -22.833333 | 5.0 | 8091.585 | 92920.500 | 3750.0 | 10.8 | 1.0 | 0.057708 | -1124.2 | 1409.6 | 11.333333 | 0.000000 | -25.00000 | 225000.00000 | 6156.400408 | 6133.363929 | 5606.152347 | 7968.609184 | 2.387755 | 19.547619 | 0.000000 | 0.000000 | 1.038889 | 17.595238 | -944.964286 | -949.814286 | 7836.897982 | 7557.738339 | -35.250000 | 15.928571 | 8.312500 |
| 307515 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -4000.0 | -4262 | 16.000000 | 1 | 0 | 0 | Unknown | 3.0 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | Unknown | Unknown | Unknown | Unknown | 0.0 | 0.0 | 0.0 | -821.0 | 1 | 0 | 0 | 0.006402 | 0.007 | 0.034362 | 0.267395 | 0.265474 | 1.899974 | 5.561196 | -1083.047110 | 1.035863 | 651.807511 | -970.304531 | 5242.425046 | 0.007919 | 378080.200789 | 160390.076973 | 5901.475578 | 49.549302 | -546.632499 | 16052.247330 | -20.984805 | 2.0 | 17782.155 | 300550.500 | 8095.5 | 5.5 | 1.0 | 0.087554 | -466.0 | 13.0 | 24.000000 | 0.000000 | -16.05068 | 222543.30983 | 15300.184003 | 3738.608219 | 11164.747865 | 75101.983900 | 1.704668 | 7.998155 | 3.142326 | 0.027201 | 1.000000 | 6.500000 | -622.000000 | -634.250000 | 11100.337500 | 11100.337500 | -21.000000 | 12.000000 | 5.846154 |
train_domain = train.copy()
# INCOME_CREDIT_PERCENT: the percentage of the income relative to a client's credit amount
train_domain['INCOME_CREDIT_PERCENT'] = train_domain['AMT_INCOME_TOTAL'] / train_domain['AMT_CREDIT']
# ANNUITY_INCOME_PERCENT: the percentage of the loan annuity relative to a client's income
train_domain['ANNUITY_INCOME_PERCENT'] = train_domain['AMT_ANNUITY'] / train_domain['AMT_INCOME_TOTAL']
# CREDIT_TERM: the length of the payment in months (since the annuity is the monthly amount due
train_domain['CREDIT_TERM'] = train_domain['AMT_ANNUITY'] / train_domain['AMT_CREDIT']
# INCOME_PER_PERSON: Income per person in a family
train_domain['INCOME_PER_PERSON'] = train_domain['AMT_INCOME_TOTAL'] / train_domain['CNT_FAM_MEMBERS']
# CNT_ADULT_FAM_MEMBER: number of adult members in a family
train_domain['CNT_ADULT_FAM_MEMBER'] = train_domain['CNT_FAM_MEMBERS'] - train_domain['CNT_CHILDREN']
# RATIO_CHILDREN_TO_ADULTS: ratio of the children - adult
train_domain['RATIO_CHILDREN_TO_ADULTS'] = train_domain['CNT_CHILDREN'] / train_domain['CNT_ADULT_FAM_MEMBER']
# RATIO_AMT_CREDIT_TO_CNT_FAM_MEMBERS: the credit loan per person in a family
train_domain['RATIO_AMT_CREDIT_TO_CNT_FAM_MEMBERS'] = train_domain['AMT_CREDIT'] / train_domain['CNT_FAM_MEMBERS']
# RATIO_AMT_CREDIT_TO_CNT_ADULT_FAM_MEMBER: the credit loan per adult people in a family
train_domain['RATIO_AMT_CREDIT_TO_CNT_ADULT_FAM_MEMBER'] = train_domain['AMT_CREDIT'] / train_domain['CNT_ADULT_FAM_MEMBER']
# AMT_INCOME_TOTAL_PER_ADULT_FAM_MEMBER: the income per adult people in a family
train_domain['AMT_INCOME_TOTAL_PER_ADULT_FAM_MEMBER'] = train_domain['AMT_INCOME_TOTAL'] / train_domain['CNT_ADULT_FAM_MEMBER']
DAYS_LAST_PHONE_CHANGE
train['DAYS_LAST_PHONE_CHANGE'].describe()
count 218646.000000 mean -933.974017 std 819.060879 min -4292.000000 25% -1544.000000 50% -723.000000 75% -248.000000 max 0.000000 Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
Lý do giá trị trong DAYS_LAST_PHONE_CHANGE mang số âm là vì dữ liệu được lưu tại thời điểm trong quá khứ so với thời gian nộp hồ sơ. Ví dụ, người đi nộp hồ sơ tại năm 2022, nếu thay đổi số điện thoại năm 2010 thì dữ liệu sẽ được lưu thành (2010 – 2022) * 365 = -4380.
Ta sẽ biến đổi DAYS_LAST_PHONE_CHANGE sang số năm bằng cách chia cho -365 ngày. Khi đó, ta sẽ có phân bố mới như bên dưới.
train['YEARS_LAST_PHONE_CHANGE'] = train['DAYS_LAST_PHONE_CHANGE'] / (-365)
DELIQUENCIES
It is very important to see how many times clients was late with payments or defaulted his loans. I suppose info about his social circle is also important. I'll divide values into 3 groups: 0, 1 and more than 1
train.loc[train['OBS_60_CNT_SOCIAL_CIRCLE'] > 1, 'OBS_60_CNT_SOCIAL_CIRCLE'] = '1+'
train.loc[train['DEF_60_CNT_SOCIAL_CIRCLE'] > 1, 'DEF_60_CNT_SOCIAL_CIRCLE'] = '1+'
train.loc[train['DEF_30_CNT_SOCIAL_CIRCLE'] > 1, 'DEF_30_CNT_SOCIAL_CIRCLE'] = '1+'
EXT_SOURCE
# Based on EXT_SOURCES features: calculate mean, max, sum, min, median
train['EXT_SOURCE_mean'] = train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis = 1)
train['EXT_SOURCES_MAX'] = train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].max(axis=1)
train['EXT_SOURCES_SUM'] = train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].sum(axis=1)
train['EXT_SOURCES_MIN'] = train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].min(axis=1)
train['EXT_SOURCES_MEDIAN'] = train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].median(axis=1)
Lý do giá trị trong PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE, PREV_BUR_MEAN_DAYS_CREDIT, PREV_BUR_MEAN_DAYS_ENDDATE_FACT mang số âm là vì dữ liệu được lưu tại thời điểm trong quá khứ so với thời gian nộp hồ sơ. Ví dụ, người đi nộp hồ sơ tại năm 2022, nếu tín dụng hết hạn năm 2010 thì dữ liệu sẽ được lưu thành (2010 – 2022) * 365 = -4380.
Ta sẽ biến đổi PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE, PREV_BUR_MEAN_DAYS_CREDIT, PREV_BUR_MEAN_DAYS_ENDDATE_FACT sang số năm bằng cách chia cho -365 ngày.
train['PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE'] = train['PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE'] / -365
train['PREV_BUR_MEAN_DAYS_CREDIT'] = train['PREV_BUR_MEAN_DAYS_CREDIT'] / -365
train['PREV_BUR_MEAN_DAYS_ENDDATE_FACT'] = train['PREV_BUR_MEAN_DAYS_ENDDATE_FACT'] / -365
#replace value 0 with -1
temp = train['PREV_BUR_MEAN_AMT_CREDIT_SUM'].replace(0,-1)
# ANNUITY-CREDIT PERCENT: percentage of the amount of annuity relative to the clients credit sum
train['CREDIT_TO_ANNUITY_RATIO'] = train['PREV_BUR_MEAN_AMT_ANNUITY'] / temp
# Previous
# PREV_APPL_MEAN_DIFF_AMT_DOWN_PAYMENT_AMT_ANNUITY:
# The average balance after paying down payment and receving annuity of previous application
# If positive: have balance
# If negative: do not have balance
train['PREV_APPL_MEAN_DIFF_AMT_DOWN_PAYMENT_AMT_ANNUITY'] = (train['PREV_APPL_MEAN_AMT_ANNUITY']
- train['PREV_APPL_MEAN_AMT_DOWN_PAYMENT'])
# Credit_card
# PREV_APPL_MEAN_CARD_MEAN_PCTG_RECEIVABLE_TOTAL_CURRENT:
# The average amount credit that the client hasn't paid during the month in total on the previous credit
train['PREV_APPL_MEAN_CARD_MEAN_PCTG_RECEIVABLE_TOTAL_CURRENT'] = (train['PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE']
- train['PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT'])
# Installment
# 'PREV_APPL_MEAN_INSTALL_MEAN_DIFF_PAYMENT':
# The difference between the average of required payment value and the amount that was actually paid.
# (The amount that client hasn't paid)
train['PREV_APPL_MEAN_INSTALL_MEAN_DIFF_PAYMENT'] = (train['PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT']
- train['PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT'])
# 'PREV_APPL_MEAN_INSTALL_MEAN_DAYS_BEFORE_DUE': How many days early was the payment made.
# If positive: The average of days early was the payment made.
# If negative: The average of days lately was the payment made
train['PREV_APPL_MEAN_INSTALL_MEAN_DAYS_BEFORE_DUE'] = (train['PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT']
- train['PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT'])
# POS cash
# PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT:
# The percentage of the required installments value that wasn't paid.
train['PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT'] = (train['PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE']
/ train['PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT'])
train.columns
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY',
...
'EXT_SOURCES_MAX', 'EXT_SOURCES_SUM', 'EXT_SOURCES_MIN',
'EXT_SOURCES_MEDIAN', 'CREDIT_TO_ANNUITY_RATIO',
'PREV_APPL_MEAN_DIFF_AMT_DOWN_PAYMENT_AMT_ANNUITY',
'PREV_APPL_MEAN_CARD_MEAN_PCTG_RECEIVABLE_TOTAL_CURRENT',
'PREV_APPL_MEAN_INSTALL_MEAN_DIFF_PAYMENT',
'PREV_APPL_MEAN_INSTALL_MEAN_DAYS_BEFORE_DUE',
'PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT'],
dtype='object', length=124)
def distribution_domain(x):
plot = sns.distplot(train_domain[x])
plt.title(x)
plt.show()
def pie_chart(x):
temp = train[x].value_counts(normalize= True)
df = pd.DataFrame({'labels': temp.index,
'values': temp.values
})
plt.pie(temp,labels=df['labels'], autopct='%.f%%', startangle=90)
#plt.legend()
plt.title("{}".format(x))
plt.show()
def pie_chart_n(x):
temp = train[x].value_counts(normalize= True)
df = pd.DataFrame({'labels': temp.index,
'values': temp.values
})
fig = px.pie(df, values=temp.values, names=temp.index, title=x)
fig.show()
def pie_chart_circle(x):
temp = train[x].value_counts(normalize= True)
df = pd.DataFrame({'labels': temp.index,
'values': temp.values})
fig1, ax1 = plt.subplots()
ax1.pie(temp, labels=df['labels'], autopct='%1.1f%%', startangle=90)
#draw circle
centre_circle = plt.Circle((0,0),0.80,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.title("{}".format(x))
#plt.legend()
plt.tight_layout()
plt.show()
def bar_chart(x):
sns.set(style="whitegrid")
ax = sns.countplot(x=x, data=train)
plt.xticks(rotation=90)
plt.show()
def bar_chart_pct(x):
sns.set(style="whitegrid")
ax = sns.histplot(train, x=x, stat="percent", multiple="dodge", shrink=.8)
plt.xticks(rotation=90)
plt.show()
def distribution(x):
plot = sns.distplot(train[x])
plt.title(x)
plt.show()
domain_features = ['INCOME_CREDIT_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM',
'INCOME_PER_PERSON', 'CNT_ADULT_FAM_MEMBER', 'RATIO_CHILDREN_TO_ADULTS',
'RATIO_AMT_CREDIT_TO_CNT_FAM_MEMBERS',
'RATIO_AMT_CREDIT_TO_CNT_ADULT_FAM_MEMBER',
'AMT_INCOME_TOTAL_PER_ADULT_FAM_MEMBER']
distribution_domain('INCOME_CREDIT_PERCENT')
Most of people have enough income to pay for their credit. Some people even have 8 times more than the amount of the loan
distribution_domain('ANNUITY_INCOME_PERCENT')
The most frequent percentage of annuity per income is below 25%. That is, most of client have to spend at about 1/4 of their income to pay for the debt.
distribution_domain("CREDIT_TERM")
The credit term is from 0.02 - 0.12, mostly fall at 0.05.
distribution_domain("INCOME_PER_PERSON")
Income of the person is the income of the client divide for the members of his/her family. The most frequent income per person is below 100000.
distribution_domain("CNT_ADULT_FAM_MEMBER")
The adult members in the family is the ones who can make money to pay for the loan. Most of the family have 1 - 2 adults.
distribution_domain("RATIO_CHILDREN_TO_ADULTS")
Children are the dependents of the adults. The higher this ratio is, the higher burden the adults have.
distribution_domain('RATIO_AMT_CREDIT_TO_CNT_FAM_MEMBERS')
distribution_domain('RATIO_AMT_CREDIT_TO_CNT_ADULT_FAM_MEMBER')
The credit amount per person in the family mostly fall on 250000.
distribution_domain('AMT_INCOME_TOTAL_PER_ADULT_FAM_MEMBER')
pie_chart("TARGET")
As we can see data is highly imbalanced. Most of the client is able to repay for the debt.
pie_chart("CODE_GENDER")
About more than 66.8% of the client is woman.
train["AGE"] = np.abs(train["DAYS_BIRTH"] / 365)
train["AGE"]
0 25.920548
2 52.180822
4 54.608219
5 46.413699
6 37.747945
...
307501 35.509589
307502 44.008219
307505 66.805479
307507 56.917808
307509 32.769863
Name: AGE, Length: 218646, dtype: float64
distribution("AGE")
The age of the client from 20 to 70. People at the age of 35 - 45 is mostly the client for the debt. In general, in the first phase of the working age, people increase the need for application for a loan from time to time. In the next period from 40 - 50, there is a slight decrease in need for a loan. But after that, from over 50 to the retirement, an average people increase the need for a home loan.
pie_chart("NAME_CONTRACT_TYPE")
Most of the loans are Cash loans which were taken by applicants. 90% loans are Cash loans.
pie_chart_circle("FLAG_OWN_CAR")
Only 32% of client have (at least) a car, which can be a good collateral.
pie_chart_circle("FLAG_OWN_REALTY")
Fortunately, nearly 70% of clients have a collateral like a house or an apartment
pie_chart_n("CNT_CHILDREN")
There are up to 70% of the client have no kids, 20% of them have only one child.
pie_chart_n("CNT_FAM_MEMBERS")
More than a half of client is in two-member-family
pie_chart_n("NAME_INCOME_TYPE")
bar_chart("OCCUPATION_TYPE")
Most of the client is the labourers, sales staff or core staff, which are working at the time of application. But many of the client is "Unknown".
pie_chart_n("NAME_EDUCATION_TYPE")
Nearly 75% of the client have the education level: Secondary/secondary special.
pie_chart_n("NAME_HOUSING_TYPE")
The most of client have a house or an apartment.
bar_chart_pct("NAME_FAMILY_STATUS")
bar_chart_pct("NAME_TYPE_SUITE")
Most of client are in a marriage, but when apply for the loan, they are unaccompanied.
for each in ["AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY"]:
distribution(each)
The amount of credit, annuity and income are nearly on the same level.
pie_chart_circle("REGION_RATING_CLIENT_W_CITY")
Most of the client are from the 2nd rating region.
distribution("REGION_POPULATION_RELATIVE")
Most of the client live in the area which are not densly populated.
bar_chart("WEEKDAY_APPR_PROCESS_START")
On weekend, people are less likely to start the process.
bar_chart("HOUR_APPR_PROCESS_START")
Clients start the process mostly at 10 AM, then, the later the less processing start.
pie_chart_circle("REG_REGION_NOT_WORK_REGION")
Flag if client's permanent address does not match work address (1=different, 0=same, at region level). Most of the client have the same work address as the permanent address.
pie_chart_circle("LIVE_REGION_NOT_WORK_REGION")
Most of the client have the same work address as the living address.
pie_chart('REG_CITY_NOT_LIVE_CITY')
pie_chart('REG_CITY_NOT_WORK_CITY')
pie_chart('LIVE_CITY_NOT_WORK_CITY')
Hầu hết địa chỉ liên lạc và nơi sống là trùng nhau. Khoảng 20% khách hàng có địa chỉ làm việc không trùng với địa chỉ liên hệ hoặc nơi sinh sống => KHOẢNG 1/4 SỐ KHÁCH HÀNG LÀM VIỆC Ở THÀNH PHỐ KHÁC NƠI SINH SỐNG
fig = px.pie(train, values=train['FONDKAPREMONT_MODE'].value_counts(),
names=train['FONDKAPREMONT_MODE'].value_counts().index, title='FONDKAPREMONT_MODE', width=500, height=500)
fig.show()
fig = px.pie(train, values=train['HOUSETYPE_MODE'].value_counts(),
names=train['HOUSETYPE_MODE'].value_counts().index, title='HOUSETYPE_MODE', width=450, height=450)
fig.show()
fig = px.pie(train, values=train['EMERGENCYSTATE_MODE'].value_counts(),
names=train['EMERGENCYSTATE_MODE'].value_counts().index, title='EMERGENCYSTATE_MODE', width=410, height=410)
fig.show()
Lack of data about state of customer (EMERGENCYSTATE_MODE, HOUSETYPE_MODE,FONDKAPREMONT_MODE) as percentage of unknown value accounts more than a half
fig, ax = plt.subplots(figsize = (30, 9))
plt.subplot(1, 4, 1)
sns.countplot(train['DEF_60_CNT_SOCIAL_CIRCLE'])
plt.subplot(1, 4, 2)
sns.countplot(train['DEF_30_CNT_SOCIAL_CIRCLE'])
plt.subplot(1, 4, 3)
sns.countplot(train['OBS_60_CNT_SOCIAL_CIRCLE'])
<AxesSubplot:xlabel='OBS_60_CNT_SOCIAL_CIRCLE', ylabel='count'>
People with late payment (>1 day) seem to be much higher in observations of 60 days interval
pie_chart('FLAG_DOCUMENT_3')
There are 71% provided document 3
pie_chart('FLAG_DOCUMENT_6')
pie_chart('FLAG_DOCUMENT_8')
More than 90% people provided document 6 and 8.
pie_chart_n('AMT_REQ_CREDIT_BUREAU_HOUR')
pie_chart_n('AMT_REQ_CREDIT_BUREAU_DAY')
pie_chart_n('AMT_REQ_CREDIT_BUREAU_WEEK')
pie_chart_n('AMT_REQ_CREDIT_BUREAU_MON')
pie_chart_n('AMT_REQ_CREDIT_BUREAU_QRT')
pie_chart_n('AMT_REQ_CREDIT_BUREAU_YEAR')
Usually more than 80% have no inquiries before an hour, a day, a week before application. From a month, a year, there are more enquiries.
pie_chart_circle('PREV_BUR_MEAN_CNT_CREDIT_PROLONG')
17% did prolong their credit account in the past
distribution('PREVIOUS_LOANS_COUNT')
Most people have at least one loan before (mostly 5 loans)
distribution("PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT")
distribution("PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT")
PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT: The day that the installment of previous credit was supposed to be paid is highest around 850 days ago and it takes a lot from 100 to 2000 days ago.
PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT: The day that the installments of previous credit paid actually is the same with PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT
=> Both these features have the same distribution and it is skewed to the right
distribution("PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT")
distribution("PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT")
=> Both of these features have the same distribution and it is skewed to the left
distribution('PREV_APPL_MEAN_DIFF_AMT_DOWN_PAYMENT_AMT_ANNUITY')
The average balance after paying down payment and receiving annuity of previous application of each person is almost from 0 to 20000. We can see that many of them have a balance after these activities (almost around 6000 to 7000).
distribution('PREV_APPL_MEAN_CARD_MEAN_PCTG_RECEIVABLE_TOTAL_CURRENT')
The average amount credit that the client hasn't paid during the month in total on the previous credit is almost above 50000 (around 65000). The second is that some clients can pay in full (around 0).
(The amount that client hasn't paid)
distribution('PREV_APPL_MEAN_INSTALL_MEAN_DIFF_PAYMENT')
The amount of Installment that client hasn't paid is almost around 0, we can see that the difference between the average of required payment value and the amount that was actually paid is not too much different and it's quite balanced.
(The percentage of the required installments value that wasn't paid.)
distribution('PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT')
The percentage of the installments value that wasn't paid is mostly from 0.5 to 0.7 (50-70%).. Most clients can pay only nearly a half.
distribution('PREV_APPL_MEAN_INSTALL_MEAN_DAYS_BEFORE_DUE')
The average number of days early was the Installment payment made is mostly positive, it proves that the number of clients pay before due or on time is a lot -> We can see these clients do not have many problems with economic conditions.
In contrast, a small amount is negative, it proves that there is small number of clients pay lately -> These clients may have problems with economic conditions or there is other factors causing them to delay
Target variable (1 - repaid loans, 0 - loans that were not repaid)
We divide the columns into 4 groups to easily analyze. We mostly use the kdeplot to find out the difference between 0 and 1 in variable "TARGET" in each features of train data.
numeric_columns = train.select_dtypes(exclude="object").drop(["TARGET", "SK_ID_CURR"], axis = 1).columns[:17]
numeric_columns = list(numeric_columns)
numeric_columns.append('AGE')
numeric_columns
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'AGE']
def kde_target(var_name, df):
# Calculate the correlation coefficient between the new variable and the target
corr = df['TARGET'].corr(df[var_name])
# Calculate medians for repaid vs not repaid
avg_repaid = df[df['TARGET'] == 0][var_name].median()
avg_not_repaid = df[df['TARGET'] == 1][var_name].median()
plt.figure(figsize = (12, 6))
# Plot the distribution for target == 0 and target == 1
sns.kdeplot(df[df['TARGET'] == 0][var_name], label = 'TARGET == 0')
sns.kdeplot(df[df['TARGET'] == 1][var_name], label = 'TARGET == 1')
# label the plot
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend()
plt.show()
# print out the correlation
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
# Print out average values
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
print('Median value for loan that was repaid = %0.4f' % avg_repaid)
for col in numeric_columns:
kde_target(col, train)
The correlation between CNT_CHILDREN and the TARGET is 0.0199 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_INCOME_TOTAL and the TARGET is -0.0182 Median value for loan that was not repaid = 135000.0000 Median value for loan that was repaid = 135000.0000
The correlation between AMT_CREDIT and the TARGET is -0.0249 Median value for loan that was not repaid = 484724.2500 Median value for loan that was repaid = 497520.0000
The correlation between AMT_ANNUITY and the TARGET is -0.0053 Median value for loan that was not repaid = 24853.5000 Median value for loan that was repaid = 24174.0000
The correlation between REGION_POPULATION_RELATIVE and the TARGET is -0.0332 Median value for loan that was not repaid = 0.0186 Median value for loan that was repaid = 0.0188
The correlation between DAYS_BIRTH and the TARGET is 0.0805 Median value for loan that was not repaid = -14253.5000 Median value for loan that was repaid = -15981.0000
The correlation between DAYS_REGISTRATION and the TARGET is 0.0454 Median value for loan that was not repaid = -4055.0000 Median value for loan that was repaid = -4596.0000
The correlation between DAYS_ID_PUBLISH and the TARGET is 0.0527 Median value for loan that was not repaid = -2801.0000 Median value for loan that was repaid = -3315.0000
The correlation between OWN_CAR_AGE and the TARGET is 0.0298 Median value for loan that was not repaid = 12.0611 Median value for loan that was repaid = 12.0611
The correlation between FLAG_WORK_PHONE and the TARGET is 0.0306 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between FLAG_PHONE and the TARGET is -0.0243 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between FLAG_EMAIL and the TARGET is -0.0003 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between CNT_FAM_MEMBERS and the TARGET is 0.0099 Median value for loan that was not repaid = 2.0000 Median value for loan that was repaid = 2.0000
The correlation between REGION_RATING_CLIENT_W_CITY and the TARGET is 0.0579 Median value for loan that was not repaid = 2.0000 Median value for loan that was repaid = 2.0000
The correlation between HOUR_APPR_PROCESS_START and the TARGET is -0.0231 Median value for loan that was not repaid = 12.0000 Median value for loan that was repaid = 12.0000
The correlation between REG_REGION_NOT_WORK_REGION and the TARGET is 0.0091 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between LIVE_REGION_NOT_WORK_REGION and the TARGET is 0.0054 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AGE and the TARGET is -0.0805 Median value for loan that was not repaid = 39.0507 Median value for loan that was repaid = 43.7836
Almost all of the features have no effect on the target. Only "AGE" (or "DAYS_BIRTH") have a little effect on the TARGET.
When at the age of 30 - 40, most people have difficulty in paying the debt. Then, the older the better client could pay for the debt. We will dig into it deeper.
# Age information into a separate dataframe
age_data = abs(train[['TARGET', 'DAYS_BIRTH']])
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365
# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)
# Group by the bin and calculate averages
age_groups = age_data.groupby('YEARS_BINNED').mean()
age_groups
plt.figure(figsize = (8, 8))
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])
# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');
There is a clear trend: younger applicants are more likely to not repay the loan! The rate of failure to repay is above 10% for the youngest three age groups and beolow 5% for the oldest age group.
This is information that could be directly used by the bank: because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This does not mean the bank should discriminate against younger clients, but it would be smart to take precautionary measures to help younger clients pay on time.
cat_columns = train.select_dtypes("object").columns[:10]
cat_columns
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE'],
dtype='object')
for feature in cat_columns:
sns.set(style="whitegrid")
ax = sns.countplot(x=feature, hue="TARGET", data=train)
plt.xticks(rotation=90)
plt.show()
With the categorical features, we can jump to conclusion one point: Either with any kind of any feature, people mostly are able to pay for the debt.
for col in domain_features:
kde_target(col, train_domain)
The correlation between INCOME_CREDIT_PERCENT and the TARGET is -0.0116 Median value for loan that was not repaid = 0.3000 Median value for loan that was repaid = 0.3000
The correlation between ANNUITY_INCOME_PERCENT and the TARGET is 0.0163 Median value for loan that was not repaid = 0.1737 Median value for loan that was repaid = 0.1665
The correlation between CREDIT_TERM and the TARGET is 0.0112 Median value for loan that was not repaid = 0.0500 Median value for loan that was repaid = 0.0500
The correlation between INCOME_PER_PERSON and the TARGET is -0.0120 Median value for loan that was not repaid = 67500.0000 Median value for loan that was repaid = 67500.0000
The correlation between CNT_ADULT_FAM_MEMBER and the TARGET is -0.0118 Median value for loan that was not repaid = 2.0000 Median value for loan that was repaid = 2.0000
The correlation between RATIO_CHILDREN_TO_ADULTS and the TARGET is 0.0222 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between RATIO_AMT_CREDIT_TO_CNT_FAM_MEMBERS and the TARGET is -0.0190 Median value for loan that was not repaid = 236880.0000 Median value for loan that was repaid = 251730.0000
The correlation between RATIO_AMT_CREDIT_TO_CNT_ADULT_FAM_MEMBER and the TARGET is -0.0148 Median value for loan that was not repaid = 272520.0000 Median value for loan that was repaid = 281490.7500
The correlation between AMT_INCOME_TOTAL_PER_ADULT_FAM_MEMBER and the TARGET is -0.0071 Median value for loan that was not repaid = 81000.0000 Median value for loan that was repaid = 83250.0000
Actually, there are no effect of any features on the TARGET variables.
group2_col = ['REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1',
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG',
'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG',
'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'DAYS_LAST_PHONE_CHANGE', 'YEARS_LAST_PHONE_CHANGE',
'EXT_SOURCE_mean', 'EXT_SOURCES_MAX', 'EXT_SOURCES_SUM', 'EXT_SOURCES_MIN', 'EXT_SOURCES_MEDIAN']
for col in group2_col:
kde_target(col, train)
The correlation between REG_CITY_NOT_LIVE_CITY and the TARGET is 0.0448 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between REG_CITY_NOT_WORK_CITY and the TARGET is 0.0515 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between LIVE_CITY_NOT_WORK_CITY and the TARGET is 0.0337 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between EXT_SOURCE_1 and the TARGET is -0.0960 Median value for loan that was not repaid = 0.5021 Median value for loan that was repaid = 0.5021
The correlation between EXT_SOURCE_2 and the TARGET is -0.1603 Median value for loan that was not repaid = 0.4276 Median value for loan that was repaid = 0.5657
The correlation between EXT_SOURCE_3 and the TARGET is -0.1525 Median value for loan that was not repaid = 0.5109 Median value for loan that was repaid = 0.5109
The correlation between APARTMENTS_AVG and the TARGET is -0.0109 Median value for loan that was not repaid = 0.1174 Median value for loan that was repaid = 0.1174
The correlation between BASEMENTAREA_AVG and the TARGET is -0.0050 Median value for loan that was not repaid = 0.0884 Median value for loan that was repaid = 0.0884
The correlation between YEARS_BEGINEXPLUATATION_AVG and the TARGET is -0.0268 Median value for loan that was not repaid = 0.9777 Median value for loan that was repaid = 0.9777
The correlation between YEARS_BUILD_AVG and the TARGET is -0.0088 Median value for loan that was not repaid = 0.7525 Median value for loan that was repaid = 0.7525
The correlation between COMMONAREA_AVG and the TARGET is 0.0023 Median value for loan that was not repaid = 0.0446 Median value for loan that was repaid = 0.0446
The correlation between ELEVATORS_AVG and the TARGET is -0.0118 Median value for loan that was not repaid = 0.0789 Median value for loan that was repaid = 0.0789
The correlation between ENTRANCES_AVG and the TARGET is -0.0070 Median value for loan that was not repaid = 0.1497 Median value for loan that was repaid = 0.1497
The correlation between FLOORSMAX_AVG and the TARGET is -0.0221 Median value for loan that was not repaid = 0.2263 Median value for loan that was repaid = 0.2263
The correlation between FLOORSMIN_AVG and the TARGET is -0.0131 Median value for loan that was not repaid = 0.2319 Median value for loan that was repaid = 0.2319
The correlation between LANDAREA_AVG and the TARGET is -0.0060 Median value for loan that was not repaid = 0.0663 Median value for loan that was repaid = 0.0663
The correlation between LIVINGAPARTMENTS_AVG and the TARGET is -0.0053 Median value for loan that was not repaid = 0.1008 Median value for loan that was repaid = 0.1008
The correlation between LIVINGAREA_AVG and the TARGET is -0.0132 Median value for loan that was not repaid = 0.1074 Median value for loan that was repaid = 0.1074
The correlation between NONLIVINGAPARTMENTS_AVG and the TARGET is 0.0166 Median value for loan that was not repaid = 0.0088 Median value for loan that was repaid = 0.0088
The correlation between NONLIVINGAREA_AVG and the TARGET is 0.0059 Median value for loan that was not repaid = 0.0284 Median value for loan that was repaid = 0.0284
The correlation between DAYS_LAST_PHONE_CHANGE and the TARGET is 0.0571 Median value for loan that was not repaid = -562.0000 Median value for loan that was repaid = -739.0000
The correlation between YEARS_LAST_PHONE_CHANGE and the TARGET is -0.0571 Median value for loan that was not repaid = 1.5397 Median value for loan that was repaid = 2.0247
The correlation between EXT_SOURCE_mean and the TARGET is -0.2179 Median value for loan that was not repaid = 0.4324 Median value for loan that was repaid = 0.5244
The correlation between EXT_SOURCES_MAX and the TARGET is -0.1737 Median value for loan that was not repaid = 0.5406 Median value for loan that was repaid = 0.6523
The correlation between EXT_SOURCES_SUM and the TARGET is -0.2179 Median value for loan that was not repaid = 1.2971 Median value for loan that was repaid = 1.5731
The correlation between EXT_SOURCES_MIN and the TARGET is -0.1892 Median value for loan that was not repaid = 0.2590 Median value for loan that was repaid = 0.4189
The correlation between EXT_SOURCES_MEDIAN and the TARGET is -0.1868 Median value for loan that was not repaid = 0.5021 Median value for loan that was repaid = 0.5109
The KDE plot above shows the probability density function of the continuous data variables and its correlation with TARGET.
We find the features that have the most impact among others are feature ralated to external resource (EXT_SOURCE) through its distribution by TARGET and with corr ~ 0.22 (EXT_SOURCE_mean) and address of customer.
Though correlation between address feature (REG_CITY_NOT_LIVE_CITY, REG_CITY_NOT_WORK_CITY, LIVE_CITY_NOT_WORK_CITY) with TARGET is just slightly higher than other variables, we suppose that it is an important consideration whether the address of clients are the same through the application or not.
group3_col = ['FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_8',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'PREVIOUS_LOANS_COUNT', 'PREV_BUR_MEAN_DAYS_CREDIT',
'PREV_BUR_MEAN_CREDIT_DAY_OVERDUE', 'PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE',
'PREV_BUR_MEAN_DAYS_ENDDATE_FACT',
'PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE',
'PREV_BUR_MEAN_CNT_CREDIT_PROLONG', 'PREV_BUR_MEAN_AMT_CREDIT_SUM',
'PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT',
'PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT',
'PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE',
'PREV_BUR_MEAN_DAYS_CREDIT_UPDATE', 'PREV_BUR_MEAN_AMT_ANNUITY',
'PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE',
'CREDIT_TO_ANNUITY_RATIO']
for col in group3_col:
kde_target(col, train)
The correlation between FLAG_DOCUMENT_3 and the TARGET is 0.0437 Median value for loan that was not repaid = 1.0000 Median value for loan that was repaid = 1.0000
The correlation between FLAG_DOCUMENT_6 and the TARGET is -0.0310 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between FLAG_DOCUMENT_8 and the TARGET is -0.0051 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_REQ_CREDIT_BUREAU_HOUR and the TARGET is -0.0024 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_REQ_CREDIT_BUREAU_DAY and the TARGET is 0.0382 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_REQ_CREDIT_BUREAU_WEEK and the TARGET is 0.0382 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_REQ_CREDIT_BUREAU_MON and the TARGET is 0.0014 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_REQ_CREDIT_BUREAU_QRT and the TARGET is -0.0045 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between AMT_REQ_CREDIT_BUREAU_YEAR and the TARGET is 0.0146 Median value for loan that was not repaid = 1.9000 Median value for loan that was repaid = 1.9000
The correlation between PREVIOUS_LOANS_COUNT and the TARGET is -0.0016 Median value for loan that was not repaid = 5.5612 Median value for loan that was repaid = 5.0000
The correlation between PREV_BUR_MEAN_DAYS_CREDIT and the TARGET is -0.0792 Median value for loan that was not repaid = 2.8261 Median value for loan that was repaid = 2.9673
The correlation between PREV_BUR_MEAN_CREDIT_DAY_OVERDUE and the TARGET is 0.0373 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE and the TARGET is -0.0510 Median value for loan that was not repaid = -0.9589 Median value for loan that was repaid = 0.0301
The correlation between PREV_BUR_MEAN_DAYS_ENDDATE_FACT and the TARGET is -0.0430 Median value for loan that was not repaid = 2.6584 Median value for loan that was repaid = 2.6584
The correlation between PREV_BUR_MEAN_AMT_CREDIT_MAX_OVERDUE and the TARGET is 0.0378 Median value for loan that was not repaid = 5242.4250 Median value for loan that was repaid = 4653.6675
The correlation between PREV_BUR_MEAN_CNT_CREDIT_PROLONG and the TARGET is 0.0362 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between PREV_BUR_MEAN_AMT_CREDIT_SUM and the TARGET is -0.0131 Median value for loan that was not repaid = 241045.9774 Median value for loan that was repaid = 226133.6805
The correlation between PREV_BUR_MEAN_AMT_CREDIT_SUM_DEBT and the TARGET is 0.0281 Median value for loan that was not repaid = 109510.6500 Median value for loan that was repaid = 68289.0525
The correlation between PREV_BUR_MEAN_AMT_CREDIT_SUM_LIMIT and the TARGET is 0.0070 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between PREV_BUR_MEAN_AMT_CREDIT_SUM_OVERDUE and the TARGET is 0.0368 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between PREV_BUR_MEAN_DAYS_CREDIT_UPDATE and the TARGET is 0.0652 Median value for loan that was not repaid = -498.6190 Median value for loan that was repaid = -546.6325
The correlation between PREV_BUR_MEAN_AMT_ANNUITY and the TARGET is 0.0051 Median value for loan that was not repaid = 16052.2473 Median value for loan that was repaid = 16052.2473
The correlation between PREV_BUR_MEAN_BUR_BAL_MEAN_MONTHS_BALANCE and the TARGET is 0.0429 Median value for loan that was not repaid = -20.9848 Median value for loan that was repaid = -20.9848
The correlation between CREDIT_TO_ANNUITY_RATIO and the TARGET is -0.0010 Median value for loan that was not repaid = 0.0476 Median value for loan that was repaid = 0.0506
From the KDE, we can clearly see that four columns PREV_BUR_MEAN_AMT_ANNUITY, PREV_BUR_MEAN_DAYS_CREDIT_UPDATE, PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE, PREV_BUR_MEAN_DAYS_CREDIT slightly affects target.
# Train dataset after creating more new features from column 83
group4_train = train.iloc[:, 82:]
# Features which belonged to 'previous_application' dataset
train_83_TARGET = group4_train.columns[:10]
train_83_TARGET = list(train_83_TARGET)
for col in train_83_TARGET:
kde_target(col, train)
The correlation between PREVIOUS_APPLICATION_COUNT and the TARGET is 0.0079 Median value for loan that was not repaid = 4.0000 Median value for loan that was repaid = 4.0000
The correlation between PREV_APPL_MEAN_AMT_ANNUITY and the TARGET is -0.0328 Median value for loan that was not repaid = 11147.1591 Median value for loan that was repaid = 12005.3475
The correlation between PREV_APPL_MEAN_AMT_CREDIT and the TARGET is -0.0127 Median value for loan that was not repaid = 108262.8750 Median value for loan that was repaid = 115857.0000
The correlation between PREV_APPL_MEAN_AMT_DOWN_PAYMENT and the TARGET is -0.0391 Median value for loan that was not repaid = 3084.0000 Median value for loan that was repaid = 4182.7500
The correlation between PREV_APPL_MEAN_HOUR_APPR_PROCESS_START and the TARGET is -0.0346 Median value for loan that was not repaid = 12.5000 Median value for loan that was repaid = 12.6616
The correlation between PREV_APPL_MEAN_NFLAG_LAST_APPL_IN_DAY and the TARGET is 0.0171 Median value for loan that was not repaid = 1.0000 Median value for loan that was repaid = 1.0000
The correlation between PREV_APPL_MEAN_RATE_DOWN_PAYMENT and the TARGET is -0.0320 Median value for loan that was not repaid = 0.0690 Median value for loan that was repaid = 0.0819
The correlation between PREV_APPL_MEAN_DAYS_DECISION and the TARGET is 0.0446 Median value for loan that was not repaid = -703.8295 Median value for loan that was repaid = -844.9380
The correlation between PREV_APPL_MEAN_SELLERPLACE_AREA and the TARGET is -0.0246 Median value for loan that was not repaid = 65.7000 Median value for loan that was repaid = 96.5000
The correlation between PREV_APPL_MEAN_CNT_PAYMENT and the TARGET is 0.0309 Median value for loan that was not repaid = 12.0000 Median value for loan that was repaid = 12.0000
We can see that 'PREV_APPL_MEAN_DAYS_DECISION' (The average of days that the decision about previous application made) have a little effect on TARGET , because there are some clearly differences between the two variables TARGET = 0 and TARGET = 1 Other features have no effect on TARGET.
train_83_TARGET = group4_train.columns[11:21]
train_83_TARGET = list(train_83_TARGET)
for col in train_83_TARGET:
kde_target(col, train)
The correlation between PREV_APPL_MEAN_CARD_MEAN_MONTHS_BALANCE and the TARGET is 0.0177 Median value for loan that was not repaid = -16.0507 Median value for loan that was repaid = -16.0507
The correlation between PREV_APPL_MEAN_CARD_MEAN_AMT_CREDIT_LIMIT_ACTUAL and the TARGET is -0.0171 Median value for loan that was not repaid = 222543.3098 Median value for loan that was repaid = 222543.3098
The correlation between PREV_APPL_MEAN_CARD_MEAN_AMT_DRAWINGS_CURRENT and the TARGET is 0.0324 Median value for loan that was not repaid = 15300.1840 Median value for loan that was repaid = 15300.1840
The correlation between PREV_APPL_MEAN_CARD_MEAN_AMT_INST_MIN_REGULARITY and the TARGET is 0.0326 Median value for loan that was not repaid = 3738.6082 Median value for loan that was repaid = 3738.6082
The correlation between PREV_APPL_MEAN_CARD_MEAN_AMT_PAYMENT_TOTAL_CURRENT and the TARGET is 0.0119 Median value for loan that was not repaid = 11164.7479 Median value for loan that was repaid = 11164.7479
The correlation between PREV_APPL_MEAN_CARD_MEAN_AMT_TOTAL_RECEIVABLE and the TARGET is 0.0387 Median value for loan that was not repaid = 75101.9839 Median value for loan that was repaid = 75101.9839
The correlation between PREV_APPL_MEAN_CARD_MEAN_CNT_DRAWINGS_CURRENT and the TARGET is 0.0339 Median value for loan that was not repaid = 1.7047 Median value for loan that was repaid = 1.7047
The correlation between PREV_APPL_MEAN_CARD_MEAN_CNT_INSTALMENT_MATURE_CUM and the TARGET is -0.0017 Median value for loan that was not repaid = 7.9982 Median value for loan that was repaid = 7.9982
The correlation between PREV_APPL_MEAN_CARD_MEAN_SK_DPD and the TARGET is -0.0078 Median value for loan that was not repaid = 3.1423 Median value for loan that was repaid = 3.1423
The correlation between PREV_APPL_MEAN_CARD_MEAN_SK_DPD_DEF and the TARGET is -0.0106 Median value for loan that was not repaid = 0.0272 Median value for loan that was repaid = 0.0272
These features have the same distribution and the frequency of 'TARGET = 0' is more than 'TARGET = 1'. But they don't have effect on TARGET.
train_83_TARGET = group4_train.columns[23:27]
train_83_TARGET = list(train_83_TARGET)
for col in train_83_TARGET:
kde_target(col, train)
The correlation between PREV_APPL_MEAN_INSTALL_MEAN_DAYS_INSTALMENT and the TARGET is 0.0387 Median value for loan that was not repaid = -714.5000 Median value for loan that was repaid = -852.6171
The correlation between PREV_APPL_MEAN_INSTALL_MEAN_DAYS_ENTRY_PAYMENT and the TARGET is 0.0392 Median value for loan that was not repaid = -725.5103 Median value for loan that was repaid = -864.9751
The correlation between PREV_APPL_MEAN_INSTALL_MEAN_AMT_INSTALMENT and the TARGET is -0.0106 Median value for loan that was not repaid = 12517.0704 Median value for loan that was repaid = 13916.2594
The correlation between PREV_APPL_MEAN_INSTALL_MEAN_AMT_PAYMENT and the TARGET is -0.0149 Median value for loan that was not repaid = 11941.5994 Median value for loan that was repaid = 13674.9544
We can see that both features:
Other features have no effect on TARGET.
train_83_TARGET = group4_train.columns[27:30]
train_83_TARGET = list(train_83_TARGET)
for col in train_83_TARGET:
kde_target(col, train)
The correlation between PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE and the TARGET is 0.0349 Median value for loan that was not repaid = -25.0000 Median value for loan that was repaid = -29.0000
The correlation between PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT and the TARGET is 0.0276 Median value for loan that was not repaid = 12.0000 Median value for loan that was repaid = 12.0000
The correlation between PREV_APPL_MEAN_POS_MEAN_CNT_INSTALMENT_FUTURE and the TARGET is 0.0352 Median value for loan that was not repaid = 7.5000 Median value for loan that was repaid = 7.1250
Most of these features don't have effect on the TARGET variables. Only 'PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE' (the average of month of balance) has a little affect on TARGET variables, because there are some clearly differences between the two variables TARGET = 0 and TARGET = 1
train_83_TARGET = ['PREV_APPL_MEAN_DIFF_AMT_DOWN_PAYMENT_AMT_ANNUITY',
'PREV_APPL_MEAN_CARD_MEAN_PCTG_RECEIVABLE_TOTAL_CURRENT',
'PREV_APPL_MEAN_INSTALL_MEAN_DIFF_PAYMENT',
'PREV_APPL_MEAN_INSTALL_MEAN_DAYS_BEFORE_DUE',
'PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT']
train_83_TARGET = list(train_83_TARGET)
for col in train_83_TARGET:
kde_target(col, train)
The correlation between PREV_APPL_MEAN_DIFF_AMT_DOWN_PAYMENT_AMT_ANNUITY and the TARGET is 0.0026 Median value for loan that was not repaid = 7302.4538 Median value for loan that was repaid = 7403.4050
The correlation between PREV_APPL_MEAN_CARD_MEAN_PCTG_RECEIVABLE_TOTAL_CURRENT and the TARGET is 0.0403 Median value for loan that was not repaid = 63937.2360 Median value for loan that was repaid = 63937.2360
The correlation between PREV_APPL_MEAN_INSTALL_MEAN_DIFF_PAYMENT and the TARGET is 0.0260 Median value for loan that was not repaid = 0.0000 Median value for loan that was repaid = 0.0000
The correlation between PREV_APPL_MEAN_INSTALL_MEAN_DAYS_BEFORE_DUE and the TARGET is -0.0230 Median value for loan that was not repaid = 10.5338 Median value for loan that was repaid = 11.7558
The correlation between PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT and the TARGET is 0.0294 Median value for loan that was not repaid = 0.6250 Median value for loan that was repaid = 0.6054
In general, we can see that there are no effect of any features on the TARGET variables.